AD- A 169  690 


ADVANCED  TELEPROCESSING  SYSTEMS 
DEFENSE  ADVANCED  RESEARCH  PROJECTS  AGENCY 

SEMI-ANNUAL  TECHNICAL  REPORT 

March  31,  1986 


I  im  m 

1  *.•  / 


Principal  Investigator;  Leonard  Kleinrock 


Computer  Science  Department 
School  of  Engineering  and  Applied  Science 
University  of  California 
Los  Angeles 


npignuBimbw  ctatcment  r 

Approved  lot  public  relAOMl 
^  Ditrtbutlon  Unlimitwl 


^  OTIC 


ELECTE 
JUL  9  1986 


B 


S  6  7  7 


05  8 


SCCURITY  CUASSiriCATIOM  or  THIS  ^AGC  Dtm 


REPORT  DOCUMENTATION  PAGE 


nCrOAT  NUMSCM 


4.  TiTte  SuAllll*) 


mua^ 


Advanced  Teleprocessing  Systems; 
Semi-Annual  Technical  Report 


READ  INSTRUCTIONS 
BEFORE  COMPLETING  FORM 


3.  reCiriCNT'S  CATALOG  NgMBCA 


s.  Tvre  or  McroAT  *  rcmoo  covcrco 

Semi-Annual  Technical 
10/1/85  -  3/31/36 _ 

c.  rcnroAMiMG  omg,  ACroAT  numaca 


7.  authoa<«> 


1.  contaact  oa  GAANT  MUMBCVO 


Leonard  Kleinrock 


MDA  903-82-C-0064 


t.  ^enrOMMIMa  OROANIZATIOM  NAMC  ANO  AOOnOS  ><>•  PMOGHAU  CLCmENT,  ^HOjeCT,  TASK 

,  _  „  .  ^  1  ^  ^  J  AFIA  *  WOKK  UNIT  MUMaCFf 

School  of  Engineering  &  Applied  Science 

University  of  California,  Los  Angeles  DARPA  Order  No.  2496 
Los  Angeles,  Ca  90024 


tl.  CONTPtOLLIMO  OFriCC  MAMC  AMO  AOOMCSS 

Defense  Advanced  Research  Projects  Agency 
1400  Wilson  Blvd 

Arlington,  VA  22209 _ 


OAIMG  AGCnCY  NAMC  4  AQOACSS^//  tfm  Oitic*)  SCCUAlTY  CLASS.  (•§  fp4f1) 


)t.  ACAOAT  DATK 

Projects  AgencyMarch  31,  1986 


IS.  NUMSCR  OF  MAOeS 

50 


Mm,  OeCLASSiriCATION/OOWNGAAOlMG 
SCMCOULC 


14.  OlSTAIBUTiON  STATCMCMr  (mi  thim  Mmgmei) 


Approved  for  Public  Release;  Distribution  Unlimited. 


IT.  OISTAlBuriOM  ST  ATCiiCAr  (mi  tMm  mgmtfmti  mmimemg  M  BimtM  70,  If  gUfm^mi  Irmm  Mm^H) 


If.  ACY  WQAOS  (CmttUnvm  rmrmtmm  migm  li  mmmmmmmfr  Igmntlfr  br  Sfoclr  nvmbmr) 

Random  access  communications,  computer  networks,  cistributed 
processing,  distributed  algorithms  for  election  and  traversal 
in  networks,  parallel  processing  systems. 


20.  ABSTAACT  (Cmt^tinum  mn  rmwmtmm  mkgm  II  mmmmmmmfr  ^g  Igm^ltfr  br  bimmb  mvmbmf^ 

This  semi-annual  technical  report  covers  research  carried  out  bv 
the  Advanced  Teleprocessing  Systems  Group  at  UCLA  under  DARPA 
Contract  no.  MDA  90 3-3 2-C-0064  covering  the  period  fron 
October  1,  1935  to  March  31,  1986. 


DO 


FOMM 
JAM  71 


eOlTIOM  OF  I  MOV  «l  IS  OatOLCTK 
i/H  0)0}-  LF.  OU-  6«01 


ICCUMITY  CU  ASIIFICATIOM  OF  THIS  FACE  f*7l«i  0.(. 


SCCUmtY  CUASSiriCATIOM  or  this  ^AOC  Omtm 


This  Semi  -  Annual  Technical  Report  covers  research  carried  out  by  the 
Advanced  Teleprocessing  Systems  Group  at  UCLA  under  DARPA  Con¬ 
tract  No.  MDA  903-82-C-0064  covering  the  period  from  October  1,  1985 
to  March  31,  1986. 

In  this  six-month  period,  six  papers  were  published  in  the  professional 
literature.  These  papers  were  in  the  fields  of  computer  networks,  multiac¬ 
cess  communications,  and  distributed  processing.  Our  main  focus  is 
currendv  in  the  direction  of  distributed  systems  performance;  we  repro¬ 
duce  two  papers  in  that  area  as  the  main  body  of  this  report  These  papers 
are  "Distributed  Systems,"  by  Leonard  Kleinrock,  and  "Broadcast  Com¬ 
munications  and  Distributed  Algorithms,"  by  Rina  Dechter  and  Leonard 
Kleinrock. 


Accession  For 

NTIS  GHA&I 
DTIC  TAB  "n 

UnannounoeG 
Just  if  tost  io.’J 


s 


DTIC 

ELECTE 
JUL  9  1986 


3 


n 


By- 


Distributlot/  _ 
AvQliabljLlty 

Avail  sn.'/or 


— ? 


Special 


S.N  0102-  LF-  014-  i«01 


ICCUWITV  CU  AIIIFICATIOM  O 


^  TMli 


PAOCfVfiM*  Dm4m 


« -1  » » -♦  • 


■  *  • » • 

.s 


'  V 


.V.v 


ADVANCED  TELEPROCESSING  SYSTEMS 
Semi-Annual  Technical  Report 
March  31,  1986 


Contract  Number:  MDA  903-82-C-0064 
DARPA  Order  Number:  2496 
Contract  Period:  February  1.  1984  to  June  30.  1986 
Report  Period:  October  1,  1985  to  March  31,  1986 

Principal  Investigator:  Leonard  Kleinrock 

Co-Pnncipal  Investigator;  Mario  Gerla 

(213)  825-2543 


Computer  Science  Department 
School  of  Engineering  and  Applied  Science 
University  of  California,  Los  Angeles 


Sponsored  by 

DEFENSE  ADVANCED  RESEARCH  PRO.JECTS  AGENCY 


The  views  and  conclusions  contained  in  this  document  are  those  of  the  authors  and  should  not  be  interpreted  as 
necessarily  represenung  the  official  policies,  either  expressed  or  implied,  of  the  Defense  Advanced  Research  Pro¬ 
jects  Agency  or  the  United  States  Government. 


-v-r.  f.  -r.  V. 


-N  .N V-  N* 


ADVANCED  TELEPROCESSING  SYSTEMS 


Defense  Advanced  Research  Projects  Agency 
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March  31.  1986 


INTRODUCTION 

This  Semi  -  Annual  Technical  Report  covers  research  carried  out  by  the  Advanced  Teleprocessing 
Systems  Group  at  UCLA  under  DARPA  Contract  No.  MDA  903-82-C-0064  covering  the  period 
from  October  1,  1985  to  March  31,  1986.  Under  this  contract  we  have  three  designated  tasks  as 
follows: 


TASK  /.  DISTRIBUTED  COMMUNICATIONS  ACCESS 

The  general  problem  of  sharing  a  multi-access  broadcast  distributed  sys¬ 
tems  among  a  set  of  competing  users  will  be  studied.  General  issues  in¬ 
volving  exhaustive  communications,  start-up  problems  and  refined 
models  to  manifest  some  more  realistic  phenomena  in  these  systems  will 
be  studied.  Applications  to  packet  radio  systems  and  large  survivable 
networks  involving  the  study  of  tandem  networks,  multi-hop  networks, 
one-way  communication  links,  correct  reception  of  more  than  one  simul¬ 
taneous  transmission  and  mobility  will  be  included.  Further  applications 
will  include  the  study  of  very  high  bandwidth  channels  and/or  very  long 
propagation  delay  systems,  multiple  token  systems  and  compound 
hierarchical  network  structures. 


TASK  II  DISTRIBUTED  PROCESSING 

The  interplay  between  distnbuted  communications  in  a  broadcast  en¬ 
vironment  and  processing  of  distnbuted  data  will  be  studied.  For  exam¬ 
ple,  the  effect  of  merging  soned  lists  in  a  broadcast  environment,  as  well 
as  finding  properties  of  elements  in  these  lists,  will  be  studied.  Con¬ 
currency  in  multiprocessor  systems  will  be  studied  in  order  to  investigate 
performance  in  terms  of  response  time  and  speedup  factors  for  various 
graph  models  of  computation.  Connection  architectures  for  multiproces¬ 
sor  systems  will  be  investigated  as  well.  One  application  here  is  the 
structure  of  the  processing  and  commumcation  architecture  for  super¬ 
computers. 


TASK  HI.  DISTRIBUTED  CONTROL  AND  ALGORITHMS 


Routing,  flow  control  and  survivability  in  large  packet  radio  networks  as 
well  as  in  public  data  networks  will  be  studied  as  control  algorithms  in  a 
distributed  environment.  Measures  of  performance,  including 
throughput,  response  time,  blocking,  power,  fairness,  and  robustness  will 
be  applied  to  these  systems.  Distributed  algorithms  for  finding  shortest 
paths,  connectivity,  loops,  etc.  will  be  studied.  The  effect  of  node  and 
link  failures,  limited  amounts  of  memory  at  each  node  and  restricted 
channel  capacity  for  communications  will  be  investigated.  The  effect  of 
network  failures  and  delays  on  distributed  data  base  management  systems 
will  also  be  studied. 


In  this  six-month  period,  six  papers  were  published  in  the  professional  literamre.  These  papers 
were  in  the  fields  of  computer  networks,  multiaccess  communications,  and  distributed  processing. 
Our  main  focus  is  currently  in  the  direction  of  distributed  systems  performance;  we  reproduce  two 
papers  in  that  area  as  the  main  body  of  this  report.  These  papers  are  "Distributed  Systems."  by 
Leonard  Kleinrock,  and  "Broadcast  Communications  and  Distributed  Algorithms,"  by  Rina 
Dechter  and  Leonard  Kleinrock. 


RESEARCH  PUBLICATIONS 


1.  Kleinrock,  L.,  "Distributed  Systems,"  invited  paper  for  ACMIIEEE-CS  Joint  Special 
Issue  (November  1985):  Communications  of  the  ACM.  Vol.  28,  No.  11,  pp.  1200-1213, 
and  Computer.  Vol.  18,  No.  11,  pp.  90-103. 

Growth  of  distributed  systems  has  attained  unstoppable  momentum.  If 
we  better  understood  how  to  think  about,  analyze,  and  design  distnbuted 
systems,  we  could  direct  their  implementation  with  more  confidence. 

2.  Rodrigues,  P.,  Fratta.L.,  and  .\I.  Gerla,  "Tokenless  Protocols  for  Fiber  Optic  Local 
Area  .Networks,"  IEEE  Journal  on  Selected  .Areas  in  Communications .  Vol.  S.AC-3,  No. 
6,  November  1985,  pp.  928-940. 

.A  family  of  L.AN  (Local  .Area  Network)  protocols  is  presented.  The 
LAN  consists  of  a  pair  of  unidirectional  liber  optic  buses  to  which  sta¬ 
tions  are  connected  via  passive  taps.  The  protocols  provide  round-robin 
bounded  delay  access  to  all  stations.  Contrary  to  mo.st  round-robin  ac¬ 
cess  schemes,  the  protocols  do  not  require  transmission  of  special  packets 
(tokens);  rather,  they  simply  rely  on  the  detection  of  bus  activity  at  each 
station.  The  performance  of  these  protocols  in  vanous  traffic  conditions 
and  system  configurations  is  evaluated  via  analysis  and  simulation. 


Dechter,  R.  and  L.  Kleinrock,  "Broadcast  Communications  and  Distributed  Al"0 
rithms,"  IEEE  Transactions  on  Computers,  Vol.  C-35,  No.  3,  March  1986,  pp  210-219. 

The  paper  addresses  ways  in  which  one  can  use  "broadcast  communica¬ 
tion"  in  distributed  algorithms  and  the  relevant  issues  of  design  and  com¬ 
plexity.  We  present  an  algorithm  for  merging  k  sorted  lists  of  n/k  ele¬ 
ments  using  k  processors  and  prove  its  worst  case  complexity  to  be  2n, 
regardless  of  the  number  of  processors,  while  neglecting  the  cost  arising 
from  possible  conflicts  on  the  broadcast  channel.  We  also  show  that  this 
algorithm  is  optimal  under  single-channel  broadcast  communications.  In 
a  variation  of  the  algorithm,  we  show  that  by  using  an  extra  local 
memory  of  0(kj  the  number  of  broadcasts  is  reduced  to  n.  When  the  al¬ 
gorithm  is  used  for  sorting  n  elements  with  k  processors,  where  each  pro¬ 
cessor  sons  its  own  list  first  and  then  merging,  it  has  a  complexity  of 
0{n/k  \og{nJk)  -t-  n),  and  is  thus  asymptotically  optimal  for  large  n.  We 
also  discuss  the  cost  incurred  by  the  channel  access  scheme  and  prove 
that  resolving  conflicts  whenever  <:  processors  are  involved  introduces  a 
cost  factor  of  at  least  log  k. 


DISTRIBUTED  SYSTEMS 


Growth  of  distributed  systems  has  attained  unstoppable  momentum.  If  we 
better  understood  how  to  think  about,  analyze,  and  design  distributed 
systems,  we  could  direct  their  implementation  with  more  confidence. 


LEONARD  KLEINROCK 


DISTRIBUTED  SYSTEMS  IN  NATURE 
How  did  the  killer  bees  find  their  way  up  to  North 
Amenca?  By  what  mechanism  does  a  colony  of  ants 
carry  out  its  complex  tasks?  What  guides  and  controls  a 
flock  of  birds  or  a  school  of  fish?  The  answers  to  these 
questions  involve  examples  of  loosely  coupled  systems 
that  achieve  a  common  goal  with  distributed  control. 

Throughout  nature  we  find  an  enormous  amount  of 
processing  taking  place  at  the  level  of  the  individual 
organisr-t  (be  it  an  ant.  a  sparrow,  or  a  human),  and  we 
have  only  begun  to  comprehend  how  processing  and 
memory  functions  operate,  especially  in  the  human 
species.  How  does  a  human  perform  the  acts  of  percep¬ 
tion.  cognition,  decision  making,  and  motor  control? 
This  processing  occurs  in  a  fraction  of  a  second,  using 
natural  processing  elements  that  are  orders  of  magni¬ 
tude  slower  than  our  current  computer  processing  ele¬ 
ments  [8J. 

We  do  know  that  the  brain  is  organized  and  struc¬ 
tured  very  differently  from  our  present  computing  ma¬ 
chines.  In  human  beings  (i.e..  in  their  internal  neural 
systems)  and  in  groups  of  organisms,  nature  has  been 
extremely  successful  in  implementing  distributed  sys¬ 
tems  that  are  far  more  clever  and  impressive  than  any 
computing  machine  humans  have  yet  devised.  We  have 
succeeded  in  manufacturing  highly  complex  devices 
capable  of  high-speed  computation  and  massive  accu¬ 
rate  memorv.  but  we  have  not  yet  gained  sufficient 
understanding  of  distributed  systems — our  systems  are 
still  highly  constrained  and  rigid  in  their  construction 
and  behavior.  The  gap  between  natural  and  man-made 
systems  is  huge,  and  we  have  a  long  way  to  go  before 

This  res«srch  was  supported  bv  the  Oelense  Advanced  Research  Protects 
Agencv  of  ihe  Department  of  Defense  under  Contract  NIDA  903'82  C0O64 
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we  bridge  the  gap  in  understanding  and  implementa¬ 
tion  (see  Figure  1.  pp.  1202-1203). 

WHY  SHOULD  WE  STUDY 
DISTRIBUTED  SYSTEMS? 

Currently  we  are  experiencing  the  effects  of  the  con¬ 
fluence  of  powerful  forces  in  information  technology. 

By  far.  the  most  significant  effect  is  the  host  of  revolu¬ 
tionary  changes  that  have  been  brought  about  by  the 
integrated  chip — especially  in  the  form  of  VLSI  and  the 
resulting  enormous  improvements  in  processing, 
storage,  and  communications.  .-Xt  the  same  time,  we 
are  experiencing  a  frightening  backlog  in  software- 
application  development  while  the  user  community  is 
clamoring  for  unprecedented  power  in  processing,  com¬ 
munications.  storage,  and  applications.  Fortunately,  we 
have  the  potential  for  this  power — if  only  we  could 
figure  out  how  to  put  all  the  pieces  together! 

Distributed  systems  have  come  into  existence  in  our 
industrial  society  in  some  very  natural  ways.  For  exam¬ 
ple.  we  have  seen  the  emergence  of  a  large  number  of 
distributed  databases — systems  that  have  evolved  be¬ 
cause  the  source  of  the  data  is  not  centralized  and 
where  there  is  a  local  need  tor  frequent  and  immediate 
access  to  the  locally  generated  data  (e  g..  the  employee 
database  at  a  branch  office  of  a  nationwide  organiza¬ 
tion)  in  addition  to  a  global  need  to  view  the  entire 
database.  Situations  such  as  these  require  us  to  place 
some  processing  power  at  the  many  distributed  loca¬ 
tions  for  collecting,  preprocessing,  and  accessing  data. 
On-line  transaction  processing  is  an  application  that 
may  contain  a  local  component  as  well  as  a  distributed- 
processing  component,  and  the  current  proliferation 
of  desktop  personal  computers  is  a  manifestation  of 
distributed-processing  power.  Indeed,  if  we  measure 


processing  power  in  MIPS  (millions  of  instructions  per 
second),  we  note  that  the  number  of  installed  .MIPS  in 
personal  computers  is  an  order  of  magnitude  greater 
than  the  number  installed  in  mainframes.  However, 
most  of  those  PC  .MIPS  lie  idle  most  of  the  time.  Imag¬ 
ine  what  a  terrific  distributed-processing  system  we 
could  fire  up  with  that  unused  power!  When  data  and 
processing  are  distributed,  we  are  obliged  to  provide 
communications  to  link  the  resources.  Thus  we  are  led 
into  the  use  of  packet  networks,  satellite  networks,  in¬ 
ternets.  cellular  and  packet  radio  networks,  metropoli¬ 
tan  area  networks,  and  local  area  networks. 

Distributed  systems  can  provide  the  necessary  power 
to  meet  the  growing  demands  of  the  user  community. 

We  are  demanding  capability  faster  than  the  advances 
in  devices  alone  can  supply,  and  to  meet  these  de¬ 
mands  we  will  have  to  rely  on  innovative  computing 
architectures  such  as  parallel-processing  systems.  These 
large  distributed  databases,  along  with  distributed- 
processing  and  distributed-communication  networks, 
have  given  rise  to  some  very  complex  distributed-system 
structures,  and  it  is  essential  that  we  learn  how  to 
think  about  them  properly  (see  Figure  2.  pp.  1204-1205). 

ARCHITECTURE  AND  ALGORITHMS 
The  world  of  applications  has  an  insatiable  need  for 
computing  power.  .A  good  mathematician  can  easily 
consume  any  finite  computing  capability  by  posing  a 
combinatoric  problem  whose  computational  complexity 
grows  exponentially  with  a  variable  of  the  problem 
le.g..  the  enumeration  of  all  graphs  with  .V  nodes).  The 
ways  in  which  we  push  back  this  "power  wall"  involve 
both  hardware  and  software  solutions.  Typically,  the 
methods  for  speeding  up  the  computation  include  the 
following; 

•  faster  devices  (a  physics  and  engineenng  problem), 

•  architectures  that  permit  concurrent  processing 
(a  system  design  problem). 

•  optimizing  compilers  for  detecting  concurrency 
(a  software-engineering  problem). 

•  algorithms  for  specification  of  concurrency  (a  lan¬ 
guage  problem),  and 

•  more  expressive  models  of  computation  Ian  analytic 
problem). 

Characterizing  the  Architecture 

There  are  many  ways  of  classifying  machine  architec¬ 
tures — too  many,  in  fact.  The  following  classification 
was  selected  for  the  purposes  of  this  article. 

We  begin  with  the  purely  serial  uniprocessor  in 
which  a  single  instruction  stream  operates  on  a  single 
data  stream  (SISD).  These  systems  are  "centralized”  at 
the  global  level,  but  really  do  contain  many  elements  of 
a  distributed  system  at  the  lower  levels,  for  example,  at 
the  level  of  communications  on  the  VLSI  chips  them¬ 
selves. 

Next  IS  the  vector  machine,  in  which  a  single  in¬ 


struction  stream  operates  on  a  multiple  data  stream 
iSlMD).  These  include  array  processors  le  g,,  systolic 
arrays)  and  pipeline  processors. 

The  third  consists  of  multiple  processors  that,  collec¬ 
tively.  can  process  multiple  instruction  streams  on  mul¬ 
tiple  data  streams  I.MIMD).  The  form  of  multiprocessing 
that  takes  place  when  multiple  processors  cooperate 
closely  to  process  tasks  from  the  same  |ob  is  referred  to 
as  parallel  processing.  On  the  other  hand,  the  term  dis¬ 
tributed  processing  is  applied  to  the  form  of  multipro¬ 
cessing  that  takes  place  when  the  multiple  processors 
cooperate  loosely  and  process  separate  jobs. 

Vector  machines  and  multiprocessing  systems  all  pro¬ 
vide  some  form  of  concurrency.  The  effect  of  this  con¬ 
currency  on  system  performance  is  important  and  is 
therefore  a  very  active  area  of  research  (see  Figure  3. 

p.  1206). 

Since  the  onslaught  of  the  VLSI  revolution,  a  number 
of  machine  architectures  have  been  implemented  in  an 
attempt  to  provide  the  supercomputing  power  toward 
which  concurrent  processing  tempts  us  [5).  Two  excel¬ 
lent  recent  summaries  of  some  of  these  projects  are 
offered  by  Hwang  [9]  and  by  Schneck  et  al.  [17],  There 
you  will  find  the  Butterfly  machine,  the  Cosmic  Cube, 
various  kinds  of  tree  machines,  the  Cedar  proiect.  the 
Sisal  language,  the  Connection  machine,  and  others 
whose  names  are  intriguingly  close  to  .Mother  .Nature  s 
systems. 

Charactenzing  the  Algorithm 

The  maior  goal  in  characterizing  the  algorithm  is  to 
identify  and  exploit  its  inherent  parallelism  (i.e..  poten¬ 
tial  for  concurrency).  The  levels  of  resolution  at  which 
we  can  attempt  to  find  this  parallelism  are  listed  below 
in  decreasing  order  of  granularity  [16): 

'  job  execution. 

'  task  execution, 

•  process  execution. 

•  instruction  execution. 

•  register  transfer,  and 

•  logic  device. 

Clearly,  as  we  drop  down  the  list  to  finer  granularity, 
we  expose  more  and  more  parallelism,  but  we  also  in¬ 
crease  the  complexity  of  scheduling  these  tiny  objects 
to  the  processors  and  of  providing  the  communications 
among  so  many  obiects  (the  problem  of  interprocess 
communication — iPC).  .As  was  stated  earlier,  if  we  op¬ 
erate  at  the  top  level  (i.e.,  at  the  iob  level),  then  we 
think  of  the  system  as  a  distributed-processing  system; 
if  we  operate  at  the  task  or  process  level,  we  have  a 
parallel-processing  system;  if  we  operate  at  the  instruc¬ 
tion  level,  we  have  the  vector  machine  and  the  array 
processor. 

Regardless  of  the  level  at  which  we  operate,  it  be¬ 
hooves  us  to  create  a  "model”  of  the  algorithm  or.  if  you 
will,  of  the  computation  we  are  processing  (tO).  .A  very 
common  model  is  the  graph  model  of  computation, 
which  IS  normally  used  at  the  task  or  process  level 


(another  common  modeling  method  is  the  use  of  Petri 
Nets).  In  this  model,  the  nodes  represent  the  tasks  (or 
processes),  and  the  directed  edges  represent  the  de¬ 
pendencies  among  the  tasks,  thereby  displaying  the 
partial  ordering  of  the  tasks  and  the  parallelism  that 
can  be  e.tploited  (see  Figure  4.  p.  1207). 

However,  the  problem  of  finding  the  parallelism  in 
the  lines  of  code  that  represent  the  algorithm  still  re¬ 
mains.  and  there  is  an  ongoing  effort  to  simplify  (and 
even  automate)  this  task  by  developing  parallel  pro¬ 
gramming  languages  for  implementing  these  algorithms 
(e.g..  .\da*.  concurrent  Pascal). 

.Matching  the  .Architecture  to  the  .Algorithm 
The  performance  of  a  distributed  system  depends 
strongly  on  how  well  the  architecture  and  the  algo¬ 
rithm  are  matched.  For  example,  a  highly  parallel  algo¬ 
rithm  will  perform  well  on  a  highly  parallel  architec¬ 
ture;  a  distributed  system  requiring  lots  of  interproces- 
sor  communication  will  perform  poorly  if  the  commu¬ 
nication  bandwidth  is  too  narrow.  This  matching  prob¬ 
lem  becomes  fierce  and  crucial  when  we  attempt  to 
coordinate  an  exponentially  growing  number  of  proces¬ 
sors  requiring  an  exponentially  growing  amount  of  in- 
terprocessor  communication.  The  apparent  solution 
to  such  an  unmanageable  problem  is  one  that 'S  self- 
organizing. 

If  we  choose  to  use  the  graph  model  discussed,  we 
are  faced  with  a  number  of  architecture/algorithm 
problems,  namely,  partitioning,  scheduling,  memory  ac¬ 
cess.  interprocess  communication,  and  synchronization. 
The  partitioning  problem  refers  to  decisions  regarding 
the  level  ot  granularity  and  the  choices  involving 
which  objects  should  be  grouped  into  the  same  node  of 
the  task  graph.  The  scheduling  problem  refers  to  the 
assignment  of  processors  and  memory  modules  to  nodes 
of  the  computation  graph.  In  general,  this  is  an  N'P- 
complete  problem  (tough  as  nails  to  do  optimally).  The 
memory-access  problem  refers  to  the  mechanism  that 
allows  processors  to  communicate  with  the  various 
memory  modules;  usually,  either  shared-memory  or 
message-passing  schemes  are  used.  The  interprocessor- 
communication  problem  refers  to  the  nature  ot  the 
communication  paths  and  connections  that  are  avail¬ 
able  to  provide  processors  access  to  the  memory  mod¬ 
ules  and  to  other  processors;  this  may  take  the  form  of 
an  intercon..ection  network  in  a  parallel-processing 
system,  a  local  area  network  in  a  local  distributed- 
processing  system  or  snareo  data  system  or  shared  pe¬ 
ripheral  system,  or  a  packet-switched,  value-added, 
long-haul  network  in  a  nationwide  distributed  system. 
Synchronization  refers  to  the  requirement  that  no  node 
in  the  graph  model  can  begin  execution  until  all  of  its 
predecessor  nodes  have  completed  their  execution. 

The  use  of  broadcast  or  multicast  communication 
opens  up  a  number  of  interesting  alternatives  for  com¬ 
munication.  Local  area  networks  take  exquisite  advan- 
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tage  of  these  communication  mooes.  .Algorithms  that 
require  tight  coupling  (i.e..  lots  of  IPC)  need  not  only 
large  banowidths  (which,  for  example,  could  be  pro¬ 
vided  by  fiber-optic  channels),  but  also  low  latency 
Specifically,  the  speed  of  light  introduces  a  15.000- 
microsecond  latency  delay  for  a  communication  that 
must  travel  from  coast-to-coast  across  the  United  States 
.Another  consideration  in  matching  architectures  to 
algorithms  is  the  balance  and  trade-off  among  commu¬ 
nication.  processing,  and  storage.  We  have  all  seen  sys- 
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riiefe  IS  an  amazing  contrast  oetween  the  neurai  structure  ot 
;ne  numan  orain  tai  ano  tno  arcnitecture  of  tooay  s  VLSI 
cnips  (PI.  The  Drain  is  massively  oaraiiei.  Jensey  tana 
weiroly)  connected  witn  leaKy  transmission  paths,  nignly  ‘auit 
tolerant,  seif-repainng.  adaptive,  noisy,  and  procapiy  nonce- 
termmistic  Man-made  computers  are  nignly  constrained, 
precseiy  (and  often  symmetncaiiy)  laid  out  with  carefully  iso- 
ated  wires,  not  very  fault  tolerant,  'argeiy  senai  ano  cen¬ 
tralized.  aetermimstic.  minimally  adaptive,  and  hardly  seif- 
reoainng.  (Photo  la)  is  the  courtesy  ot  Peter  Arnold,  inc.) 


FIGURE  1.  Natural  and  Man-Made  Architectures 
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terns  where  one  of  these  resources  can  be  exchanged 
for  others.  For  example,  if  we  do  some  preprocessing  in 
the  form  of  data  compression  prior  to  transmission,  we 
can  cut  down  on  the  communication  load  (trade  pro¬ 
cessing  for  communication).  If  we  store  a  list  of  compu¬ 
tational  results,  vve  can  cut  down  on  the  need  to  recom¬ 
pute  the  elements  of  the  list  each  time  we  need  the 
same  entry  (trade  storage  for  processing).  Similarly,  if 
we  store  data  from  a  previous  communication,  we  need 
merely  transmit  the  data  address  or  name  of  the  pre¬ 
vious  message  rather  than  the  message  itself  (trade  stor¬ 
age  for  communication).  Selecting  the  appropnate  mix 
in  a  given  problem  setting  is  an  important  issue. 

Distributed  algorithms  operating  in  a  distributed  net¬ 
work  environment  (e.g..  a  packet-switched  network) 
pose  the  possibility  that  network  failures  may  cause  the 
network  to  temporarily  be  partitioned  into  two  (or 
more)  isolated  subnetworks.  In  such  a  case,  detection 
and  recovery  mechanisms  must  be  introduced  (see  Fig¬ 
ure  5.  p.  1207). 

Lastly.  It  should  be  mentioned  that  very  little  is 
known  about  characterizing  those  properties  of  an  algo¬ 
rithm  that  cause  it  to  perform  well  or  poorly  in  a  dis¬ 
tributed  environment. 

PERFORMANCE  AND  BEHAVIOR 
VVe  do  know  some  things  about  the  way  distributed 
svstems  behave,  precious  few  though  they  may  be.  The 
most  interesting  thing  about  them  is  that  they  come  to 
us  from  research  in  very  different  fields  of  study.  Un¬ 
fortunately.  the  collection  of  results  (of  which  the  fol¬ 
lowing  IS  a  sample)  is  just  that— a  collection,  with  no 
fundamental  models  or  theory  behind  it. 

We  begin  by  considering  closely  coupled  systems,  in 
particular,  parallel-processing  systems.  One  of  the  most 
compelling  applications  of  parallel  processing  is  in  the 
area  of  scientific  computing,  where  the  speed  of  the 
world's  largest  uniprocessors  is  hopelessly  inadequate 
to  handle  the  computational  complexity  required  for 
many  of  these  problems  (3).  Of  course,  the  idea  is  that, 
as  we  apply  more  parallel  processors  to  the  computa¬ 
tional  )0b.  tne  time  to  complete  that  job  will  drop  in 
proportion  to  the  number  of  (identical)  processors.  P. 
The  “speedup"  factor,  denoted  by  S.  is  a  common  mea¬ 
sure  of  performance  for  parallel-processing  systems 
and  is  defined  as  the  time  required  to  complete  the  job 
using  P  processors,  divided  into  the  time  required  to 
complete  the  job  using  one  of  these  processors.  S  may 
also  be  interpreted  as  the  average  number  of  busy  pro- 

FKiURE2.  A  Complex  Otsnibutcd  System  (lett) 

Humans  nave  created  some  untsehevaPty  complex  distnbuted 
systems.  The  fact  that  they  worn  at  all  is  amazing,  given  that 
we  nave  not  yet  uncovered  the  basic  ponapies  determining 
tneir  behavior.  (From  Martin.  J.  Design  and  Strategy  lor  Dis¬ 
tributed  Data  Processing.  Prentice-Hail.  Englewood  Clifts. 
NJ..  1981.) 


cessors.  that  is.  the  concurrency.  The  best  we  can 
achieve  is  for  5  to  grow  directly  with  P:  that  is. 

SsP 

Thus,  in  general.  1  <  S  S  P.  In  the  early  days  of  parallel 
processing.  .Minsky  [15]  coniectured  a  depressingiy  pes¬ 
simistic  form  for  the  typical  speedup;  namelv. 

S  «  log  P. 

Often  that  kind  of  poor  performance  is  indeed  ob¬ 
served.  Fortunately,  however,  experience  has  shown 
that  things  need  not  be  that  bad.  For  example,  we  can 
achieve  S  «  0.3P  for  certain  programs  by  carefully  ex¬ 
tracting  the  parallelism  in  Fortran  DO  loops  [14].  How¬ 
ever.  .Amdahl  has  pointed  out  a  serious  limitation  to 
the  practical  improvements  one  can  achieve  with  paral¬ 
lel  processing  (the  same  argument  applies  to  the  im¬ 
provements  available  with  vector  machines)  [1]  He  ar¬ 
gues  that,  if  a  fraction,  f.  oi  a  computation  must  be  done 
serially,  then  the  fastest  that  S  can  grow  with  P  Is 

P 

Km..  -  ^  1  _  ^ 

We  see  that,  for  f  »  1  (everything  must  be  serial). 

Sm..  »  1;  for  ,f  «  0  (everything  in  parallel). 

The  actual  amount  of  parallelism  (l.e..  S)  achieved  in 
a  parallel-processing  system  is  a  quantity  that  we 
would  like  to  be  able  to  compute.  S  is  a  strong  function 
of  the  structure  of  the  computational  graph  of  the  jobs 
being  processed.  1.  with  one  of  my  students  [2].  have 
been  able  to  calculate  5  exactly  as  a  simple  function  of 
the  graph  model.  Specifically,  we  consider  a  parallel- 
processing  system  with  P  processors  and  with  an  arrival 
rate  of  X  )obs  per  second.  We  assume  the  collection  of 
jobs  can  be  modeled  with  an  arbitrary  computation 
graph  with  an  average  of  .N  tasks  per  job.  each  task 
requiring  an  average  of  i  seconds.  Then  it  can  be 
shown  that 

I  XiVi  for  XNi  £  P 
^  ”  ''l  P  for  XNi  >  P 

This  IS  a  very  general  result;  in  some  special  cases,  the 
distribution  of  the  number  of  busy  processors  can  be 
found  as  well. 

So  far.  we  have  given  ourselves  the  luxury  of  increas¬ 
ing  the  system  s  computational  capacity  as  we  have 
added  more  processors  to  the  system.  Let  us  now  con¬ 
sider  adding  more  processors,  but  in  a  fashion  that 
maintains  a  constant  total  system  capacity  (i.e..  a  con¬ 
stant  system  throughput  in  jobs  completed  per  second). 
This  will  allow  us  to  see  the  effect  of  distributing  the 
computation  for  a  job  over  many  smaller  processors. 
The  particular  structure  we  are  considering  is  the  regu¬ 
lar  series-parallel  structure  shown  in  Figure  6  (p.  12081. 
where  we  have  taken  a  total  processing  capacity  of  C 
MIPS  and  divided  it  equally  into  mn  processors,  each  of 
C/mn  MIPS.  On  entering  the  system,  a  job  selects 
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Ttie  graon  model  of  computation  is  an  extremely  useful 
model  for  displaying  tne  parallelism  innerent  m  an  algomtim 
(i.e..  a  |00|.  The  entire  grapn  represents  tne  computational 
tasKs  associated  witn  mat  ;a0.  tne  nodes  represent  tne  tasks 
tnemseNes.  ano  me  directed  arcs,  wnicn  define  a  partial 
ordenng  of  me  nodes,  represent  me  sequence  m  whicn  tne 
tasks  must  oe  pertormeo. 


FIGURE  4.  Grapii  Model  of  Computation 

lequally  likeiy)  any  one  of  the  m  series  branches  down 
which  It  will  travel.  It  will  receive  l/n  of  its  total  pro¬ 
cessing  needs  at  each  of  the  n  series-connected  proces¬ 
sors  The  key  result  for  this  system  (13)  is  that  the 
mean  response  lime  for  )obs  in  this  series-parallel  pipe 
line  svsiem  is  mn  times  as  large  as  it  would  have  been 


had  the  jobs  been  processed  by  a  single  processor  of  C 
MIPS!  There  are  some  statistical  assumptions  behind 
this  result,  but  the  message  is  clear — distributed  pro¬ 
cessing  of  this  kind  is  terrible.  Why.  then,  is  everyone 
talking  about  the  advantages  of  distributed  processing? 
The  answer  must  be  that  a  large  number  of  small  pro¬ 
cessors  (e.g..  microprocessors)  with  an  aggregate  capac¬ 
ity  of  C  MIPS  is  less  expensive  than  a  large  uniproces¬ 
sor  of  the  same  total  capacity.  It  can  be  shown  that  the 
series-parallel  system  will  have  the  same  response  time 
as  the  uniprocessor  if  the  aggregate  capacity  of  the 
series-parallel  system  has  K  times  the  capacity  of  the 
uniprocessor  where 

K  =»  mn  —  p(mn  —  1) 

and  where  p  is  the  utilization  factor  for  each  processor; 
namely,  p  =  arrival  rate  of  jobs  times  the  average  ser¬ 
vice  time  per  job  for  a  processor.  This  says  that,  for 
light  loads  (p  <*:  1).  K  =  mn.  whereas,  for  heavy  loads 
(p  —  1).  if  =  1.  Is  it  the  case  that  smaller  machines  are 
mn  times  less  expensive  than  larger  machines  (so  that 
we  can  purchase  mn  times  the  capacity  at  the  same 
total  price,  as  is  needed  in  the  light-load  case)?  To  an¬ 
swer  this  question,  recall  a  law  that  was  empirically 
observed  by  Grosch  more  than  three  decades  ago. 
Grosch's  law  [7)  states  that  the  capacity  of  a  computer 
is  related  to  its  cost,  which  we  denote  by  D  (dollars) 
through  the  following  equation: 

C»/D* 

where  /  is  a  constant.  This  law  may  be  rewritten  as 

2  .  _L 
c  “  v/c' 

Grosch  tells  us  that  the  economics  are  exactly  the  reverse 
of  what  we  need  to  break  even  with  distributed  pro¬ 
cessing!  He  says  that  larger  machines  are  cheaper  per 
MIPS.  If  Grosch  is  correct  today,  then  why  are  micro¬ 
processors  selling  like  hotcakes’  A  more  recent  look  at 
the  economics  explains  why.  Ein-Dor  [•!)  shows  that,  if 
we  consider  all  computers  at  the  same  time.  Grosch's 
law  is  clearly  not  true,  as  seen  in  Figure  7  (p.  1208).  In 
this  figure  we  see  that  microcomputers  are  a  good  buy. 
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Network  failures  can  create  two  seo- 
arated  suonetworks  mat  cannot 
communicate  until  me  failure  is  re- 
paired.  Mamtaintng  consistency  of 
dataoases  in  sucn  a  situation  is  a  key 
issue  m  distntxjted-systems  design. 


RGURE  S.  A  Partitioned  Networli 


However,  as  Ein-Dor  points  out.  Grosch's  law  is  still 
true  today  if  we  consider  families  of  computers.  Each 
family  has  a  decreasing  cost  per  unit  of  capacity  as 
capacity  is  increased.  Ein-Dor  goes  on  to  make  the  ob¬ 
servation  that,  if  one  needs  a  certain  number  of  MIPS, 
then  one  should  purchase  computers  from  the  smallest 
family  that  can  currently  supply  that  many  MIPS.  Fur¬ 
thermore.  once  in  the  family,  it  pays  to  purchase  the 
biggest  member  machine  in  that  family  (as  predicted  by 
Grosch). 

Now  that  we  have  discussed  the  performance  of 
parallel-processing  systems  for  some  special  cases,  let 
us  generalize  the  ways  in  which  jobs  pass  through  a 
multiprocessor  system,  and  analyze  the  system 
throughput  and  response  time.  Indeed,  we  bound  these 
key  system-performance  measures  in  the  following 
way:  Suppose  we  have  a  population  of  M  customers 
competing  for  the  resources  of  the  system,  .'\ssume  that 
customers  generate  )obs  to  be  processed  by  some  of  the 
system's  resources,  that  the  way  in  which  these  jobs 
bounce  around  among  the  resources  is  specified  in  a 
probabilistic  fashion,  and  that  the  mean  response  time 
of  this  system  is  T  seconds.  When  a  customer’s  job 
leaves  the  system,  that  customer  then  begins  to  gener¬ 
ate  another  job  request  for  the  system,  where  the  aver¬ 
age  time  to  generate  the  request  is  to  seconds.  Of  inter¬ 
est  is  the  mean  response  time.  T.  and  the  system 
throughput  7  as  a  function  of  the  other  system  param¬ 
eters.  .■Mthough  we  have  been  extremely  general  in  the 
system  description,  we  can  nevertheless  place  an  excel¬ 
lent  upper  bound  on  the  system  throughput  and  an 
excellent  lower  bound  on  the  mean  response  time  as 
shown  in  Figure  8.  In  this  figure,  the  quantity  .Vf  is 
defined  as  the  ratio  of  the  mean  cycle  time  T,  +  to  to 
the  mean  time  Zo  required  on  the  critical  resource  in  a 
cycle;  To  is  the  mean  response  time  when  M  -  1.  and 
the  critical  resource  is  that  system  resource  that  is  most 
heavily  loaded  [11]. 

To  find  the  exact  behavior  (shown  in  dashed  lines  in 
the  figure)  rather  than  the  bounds,  one  must  be  much 
more  explicit  about  the  distributions  of  the  service  time 
required  by  |obs  at  each  resource  in  the  system  as  well 
as  the  queueing  discipline  at  each.  Using  the  bounds  or 
the  exact  results,  the  effect  of  parameter  changes  on  the 
system  behavior  can  be  seen.  For  example,  one  can 
examine  the  accuracy  of  the  common  rule  of  thumb 
that  suggests  that  the  proper  mix  of  microprocessor 
speed,  memory  size,  and  communication  bandwidth  is 
in  the  proportion  1  MIPS.  1  Mbyte,  and  1  .Mbit  per 
second,  some  suggest  that  we  will  soon  .see  a  10.  10.  10 
mix  instead  of  the  1. 1.  1  mix.  Of  course  i  .e  correct 
answer  to  this  question  depends  on  the  total  system 
configuration. 

Once  we  evaluate  the  throughput  and  mean  response 
time  for  a  system,  we  usually  want  to  find  the  relation¬ 
ship  between  the  'wo.  which  typically  has  the  well- 
known  shape  (shown  in  Figure  9.  p.  1210)  that  clearly 
demonstrates  the  trade-off  between  them— a  low  delay 
implies  a  small  throughput  and  vice  versa. 
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M*  Input  load  IM) 

(a)  Bound  on  Throughput 


(b)  Bound  on  Mean  Response  Tune 


ExceUem  oounos  on  tnrougnput  (a)  and  mean  response  time 
(b)  as  a  function  of  me  number  of  users  (or  any  measure  of 
me  input  load)  are  easily  obtained  for  a  very  large  class  of 
distributed  systems.  The  exact  benavior  can  oe  derived  for 
more  restricted  systems  and  demonstrates  me  excellence  of 
the  bounds. 


RGURE  8.  Bounds  on  Throughput  and  Response  Time 


We  are  immediately  compelled  to  inquire  about  the 
location  of  the  “optimar  operating  point  for  a  system. 
The  answer  depends  on  how  much  you  hate  delay  ver¬ 
sus  how  much  you  love  throughput.  One  way  to  quan¬ 
tify  this  love-hate  choice  is  to  define  a  quantity  known 
as  “power"  (denoted  by  P)  which  is  defined  as 


The  operating  point  that  optimizes  (i.e..  maximizes)  the 
power  (large  throughput  and  small  delay)  is  located  at 
that  throughput  where  a  straight  line  (of  minimum 
slope)  out  of  the  origin  touches  the  throughput-delay 
profile  (usually  tangentially);  such  a  tangent  and  oper¬ 
ating  point  are  shown  in  Figure  9.  This  result  holds  for 
all  profiles  and  all  flow-control  functions  (see  below). 
Moreover,  for  a  large  class  of  queueing  curves,  this  opti¬ 
mal  operating  point  implies  that  the  system  should  be 
loaded  in  such  a  way  that  each  resource  has.  on  the 
average,  exactly  one  fob  to  work  on  [12]. 
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Th*  OMy-mrougnput  reiationanio.  an  axampio  of  th*  Koy 
profilo  «  systams  parfonnance  evaluation,  dearly  shows 
the  traO*<ff  between  me  two.  in  general,  you  cannot  get 
a  small  delay  and  a  large  throughout  at  the  same  ama.  We 
can.  however,  maximze  "power.*  wfuch  is  the  ratio  of 
throughput  to  delay.  «  order  to  define  the  natural  pant  tor 
a  system. 


RGURE9.  The  Key  System  Profile 


Unfortunately,  there  are  some  distributed  systems 
that  do  not  have  the  nice  relationship  shown  in  Figure 
8a  where  the  throughput  rises  asymptotically  to  its 
maximum  value  as  the  “input"  is  increased.  Often  we 
find  the  behavior  depicted  in  Figure  10  where  the 
throughput  reaches  a  peak  and  then  declines  as  the 
input  increases  further,  possibly  dropping  to  zero,  in 
which  case  we  say  that  the  system  has  crashed.  Such 
behavior  has  been  observed  in  paged  virtual-memory 
systems  (thrashing),  in  computer  networks  (deadlocks 
and  degradations),  and  in  automobile  traffic  flow 
(bumper-to-bumper  traffic).  Here  again,  one  must  find  a 
method  for  controlling  the  input  (i.e..  setting  the  system 
operating  point)  so  as  to  achieve  optimal  or  near- 
optimal  performance  (somewhere  near  the  peak  of  the 
curve  in  Figure  10). 

“Flow  contror  is  the  name  associated  with  this  oper¬ 
ation.  and  it  can  be  implemented  in  a  centralized  or  a 
distributed  fashion  in  distributed  systems  with  the  lat¬ 
ter  being  the  more  challenging  design  problem  [6],  One 
example  of  distributed  control  is  the  dynamic  routing 
prccedure  found  in  many  of  today's  packet-switching 
networks  where  no  single  switching  node  is  responsible 
for  the  network  routing.  Instead,  all  nodes  participate  in 
the  selection  of  network  routes  in  a  distributed  fashion. 
.-K  great  deal  of  research  is  currently  under  way  to  eval¬ 
uate  the  performance  of  other  distributed  algorithms  in 


networks  and  distributed  systems.  Examples  are  the 
distributed  election  of  a  leader,  distributed  rules  for 
traversing  all  the  links  of  a  network,  and  distributed 
rules  for  controlling  access  to  a  database. 

.■\nother  large  class  of  distributed -control  algonthms 
has  to  do  with  sharing  a  common  communication  chan¬ 
nel  among  a  number  of  devices  in  a  distributed  fashion 
(19).  If  the  channel  is  a  broadcast  cnannel  (aiso  known 
as  a  one-hop  channel),  then  the  analytic  and  design 
problem  is  fairly  manageable  and  a  number  of  popular 
local  area  network  algorithms  for  media  access  control 
have  been  studied  and  implemented.  Examples  here 
include  CSMA/CD  (earner-sense  multiple  access  with 
collision  detect— as  used  in  .Xerox's  Ethernet.  .-MiT's 
3B-Net  and  Starlan.  and  IBM's  PC  Network),  token 
passing  (as  used  in  the  token-ring  and  token-bus  net¬ 
works).  and  address  contention  resolution  (as  used  in 
AT&T's  ISN).  A  large  number  of  additional  channel  ac¬ 
cess  algorithms  have  been  studied  in  the  literature  in¬ 
cluding  Expressnet.  tree  algorithms,  urn  models,  and 
hybrid  models.  If  the  channel  is  multicast  lor  multi¬ 
hop).  then  the  analytic  problem  becomes  much  harder. 

But  what  if  the  processors  in  our  distributed  environ¬ 
ment  are  allowed  to  communicate  with  their  peers  in 
very  limited  ways?  Can  we  endow  these  processors  (let 
us  call  them  automatons  for  this  discussion)  with  an 
internal  algorithm  that  will  allow  them  to  achieve  a 
collective  goal?  Tsetlin  [20]  studied  this  problem  at 
length  and  was  able  to  demonstrate  some  remarkable 
behavior.  For  example,  he  describes  the  Coore  game  in 
which  the  automatons  possess  finite  memory  and  act  in 
a  probabilistic  fashion  based  on  their  current  state  and 
the  current  input.  They  cannot  communicate  with  each 
other  at  all  and  are  required  to  vote  YES  or  NO  at 


There  are  many  systems  that  oegraoe  badly  wnen  pushed 
too  hard.  They  can  even  degrade  to  a  situation  of  deadlock. 
Examples  include  thrashing  m  virtual  memory  systems,  dead¬ 
locks  in  computer  networks,  and  bumper-to-oumper  traffic  m 
highway  systems. 


FIGURE  10.  A  Oangertxis  Throughput  Profile 


certain  times.  The  automatons  are  not  aware  of  each 
ether's  vote;  however,  there  is  a  referee  who  can  ob¬ 
serve  and  calculate  the  percentage,  p.  of  automatons 
that  vote  YES.  The  referee  has  a  function,  /(p)  (such  as 
that  shown  in  Figure  11).  where  we  require  that  0  S 
f[p)  s  1.  Whenever  the  referee  observes  a  percentage,  p. 
who  vote  YES.  he  or  she  will,  with  probability,  /(p). 
reward  each  automaton,  independently,  with  a  one  dol¬ 
lar  pavment;  with  probability  1  —  /(p)  he  or  she  will 
punish  an  automaton  by  taking  one  dollar  away.  Tsetlin 
proved  that  no  matter  how  many  players  there  may  be 
in  a  Goore  game,  if  the  automatons  have  sufficient 
memory,  then  for  the  payoff  probability  shown  in  the 
figure,  exactly  20  percent  of  the  automatons  will  vote 
YES  with  probability  one!  This  is  a  beautiful  demon¬ 
stration  of  the  ability  of  a  distributed-processing  system 
to  act  in  an  optimum  fashion,  even  when  the  rules  of 
the  reward  funciion  are  unknown  to  the  players  and 
when  they  can  neither  observe  nor  communicate  with 
each  other.  All  they  are  allowed  is  to  vote  when  asked, 
and  to  observe  the  reward  or  penalty  they  receive  as  a 
result  of  that  vote.  In  this  work  we  see  the  beginnings 
of  a  theory  that  may  be  able  to  explain  how  the  colony 
of  ants  performs  its  tasks. 

.NEEDED  UNDERSTANDING  AND  TCXJLS 
In  the  previous  section,  we  discussed  a  few  of  the 
things  known  about  HistrihuieH-systems  performance 
and  behavior  A  few  isolated  facts  are  indeed  known, 
but  overall  theory  and  understanding  are  still  lacking. 

For  instance  we  need  considerably  sharper  tools  to 
evaluate  the  ways  in  which  randomness,  noise,  and 
inaccurate  measurements  affect  the  performance  of  dis¬ 
tributed  systems.  What  is  the  effect  of  distributed  con¬ 
trol  in  an  environment  where  that  control  is  delayed, 
based  on  estimates,  and  not  neces.-.arily  consistent 
throughout  the  system?  What  is  the  effect  on  perfor¬ 
mance  of  scaling  some  of  the  svstem  parameters?  We 
need  a  common  metric  for  discussing  the  various  sys¬ 
tem  resources  of  communications,  storage,  and  process¬ 
ing.  For  example,  is  there  a  processing  component  to 
communications’  We  also  need  a  proper  way  to  discuss 
distributed  algorithms  and  distributed  architectures. 

.\  microscopic  theory  that  deals  with  the  interaction 
of  each  |ob  with  each  component  of  the  system  is  likely 
to  overwhelm  us  with  detail  and  will  fail  to  lead  us  to 
an  understanding  of  the  overall  s;.stem  behavior.  It  is 
similar  to  tne  futility  of  studying  the  many-body  prob¬ 
lem  in  physics  in  order  to  obtain  the  global  behavior  of 
solids.  What  is  needed  is  a  macroscopic  theory  of  dis¬ 
tributed  systems,  such  as  thermodynamics  has  provided 
for  the  physicist.  In  fact.  Yemini  [21]  has  proposed  an 
approach  for  a  macroscopic  theory  based  on  statistical 
mechanics  that  will  lead  to  better  understanding  the 
global  behavior  of  distributed  systems  without  the  need 
for  a  detailea.  fine-grained  analysis. 

.Another  fruitful  approach  that  also  avoids  the  horri¬ 
ble  details  of  any  specific  system  structure  must  be 
credited  to  Shannon  [18|.  In  analyzing  the  behavior  of 


The  Goore  game  rewards  each  member  ot  a  set  of  automa¬ 
tons  inoepenoentty  wnn  a  orooaoiiitv  given  by  me  function 
f(p).  where  p  «  me  fraction  of  the  set  that  votes  TES  at  a 
given  time.  The  automatons  are  oompieteiv  unaware  ot  me 
other  automatons,  do  not  know  the  function  Up),  and.  re- 
marttaoiy.  wia  collectively  vote  in  a  way  mat  maximizes  the 
payoff  to  as. 


riGURE  11.  The  Goore  Game 


error-correcting  codes  for  noisy  communication  chan¬ 
nels.  Shannon  used  the  brilliant  device  of  studying  nil 
possible  codes  simultaneously.  This  enabled  him  to  aver¬ 
age  out  the  detailed  structure  of  any  given  code.  He 
could  then  lake  exquisite  advantage  of  the  law  of  large 
numbers  in  order  to  arrive  at  a  precise  statement  re¬ 
garding  the  error  behavior  of  codes.  It  is  likely  that 
such  an  approach  will  allow  us  to  study  the  behavior 
of  “tvpicaL"  topologies  and  algorithms  in  distributed 
systems. 

LIKELY  FUTURE  DEVELOPMENTS 
These  are  exciting  times  Researchers  in  universities 
and  laboratories  around  the  world  have  begun  to  focus 
their  attention  on  distributed  systems  They  come  to 
this  field  frem  diverse  disciplines  ranging  from 
queueing  theory  to  neurcanatomy  in  which  they  are 
the  experts.  Thus,  we  have  the  ingredients  for  an  enor¬ 
mously  rich  soup  of  separate  ideas  that  have  only  just 
begun  to  blend. 

.As  the  theoretical  frontiers  are  being  assaulted,  so  too 
are  the  practitioners  busily  building  systems.  This  is  a 
double-edged  sword.  On  the  one  hand,  the  implementa¬ 
tion  of  real  distributed  systems  in  the  hands  of  the 
designers  and  users  provides  us  with  a  strong  motiva¬ 
tion  for  progress  in  understanding,  as  well  as  a  magnifi¬ 
cent  test  bed  in  which  we  can  experiment  On  ttie 
other  hand,  these  systems  are  massively  expensive  and 
are  being  implemented  without  the  benefit  of  the  prin¬ 
ciples  we  seek.  .As  a  result,  they  may  be  colossal  fail¬ 
ures!  The  reality  is  that  there  is  no  way  we  can  prevent 
their  proliferation  as  manufacturers  respond  to  the 
frenzied  demand  from  the  user  communitv  In  a  sense 
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UNDERLYING  PHINOPIES  OF  OISTRIBUTED-SYSTEMS  BfcnAVIOR 


•  Oevaloong  innovativ*  arcnitectures  for  paraiw  processing 
.  Providing  better  i»iguages  and  aigontfims  for  speofication 

ot  concurrency 

•  More  expressive  modets  of  computation 

•  Matcnmg  me  arcnrtecture  to  me  aigonmm 

•  Understanding  me  trade-ott  among  commurHcatxxi.  pro- 
cessNig.  and  storage 

•  Evaluation  of  me  speeocp  faaor  for  cfasses  of  aigontnms 
and  arcntecturas 


we  are  ail  responsible  for  the  current  craziness,  because 
we  have  been  "promising"  these  miraculous  systems  to 
the  user  for  almost  a  decade. 

In  the  face  of  these  developments,  we  can  foresee 
some  of  the  likely  developments  that  will  take  place 
over  the  next  decade  or  two.  Let  us  first  consider  the 
likely  technology  developments  in  hardware-type  re¬ 
sources.  One  of  the  most  exciting  of  these  is  the  huge 
data  bandwidth  protected  for  hber-optic  technology. 
These  fibers  are  being  used  for  point-to-point  commu¬ 
nication  pipes  at  rates  on  the  order  of  hundreds  of  me¬ 
gabits  per  second.  Beil  Laboratones  and  japan  have 
been  leapfrogging  each  other  in  setting  world  records 
for  the  largest  data  rates  transmitted  over  the  longest 
distances.  Earlier  this  year.  Beil  Laboratories  estab¬ 
lished  a  new  record  by  transmitting  at  the  rate  of  4 
billion  bits  per  second  at  a  distance  of  117  km  without 
any  repeaters!  The  product  of  data  rate  times  distance 
has  been  doubling  every  year  since  1975  and  based  on 
the  limits  imposed  by  physics,  there  are  still  five  orders 
of  magnitude  to  go  (16  years  of  doubling  left).  The  tiny 
glass  fiber  is  so  clear  that,  if  the  oceans  of  the  world 
were  made  of  this  glass,  one  could  see  the  bottom  of  the 
deepest  trench  in  the  ocean  floor  from  the  surface.  If 
we  consider  a  1-mVV  laser  and  a  requirement  of  10 
photons  to  detect  1  bi  of  information  'high-quality  de¬ 
tection).  then  a  single  ..trand  of  fiber  should  be  able  to 
support  a  data  bandwidth  of  10*’  bits  per  second.  That 
would  provide,  for  example,  a  100-Mbit-per-second 
channel  to  each  of  10  million  users — all  on  one  thin 
strand!  This  light-wave  technology  is  being  installed 
across  the  United  States  right  now.  The  Los  .Angeles 
1984  Olympics  video  was  transmitted  from  the  games' 
remote  locations  to  satellite  transmitters  using  a  fiber¬ 
optic  network  installed  by  Pacific  Beil — perhaps  the 
most  well-known  application  to  date.  This  technology  is 
being  applied  to  local  area  networks  by  a  number  of 
vendors,  but  the  technology  for  this  application  is  not 
yet  mature  because  we  have  yet  to  develop  an  efficient 
way  to  optically  tap  into  the  light  pipes  at  low  loss,  .^s 
soon  as  that  problem  is  resolved  (in  the  next  two  or 
three  years),  we  are  likely  to  see  a  rapid  deployment  of 
fiber-optic  channels  in  our  local  network  environment. 


•  Evaluation  of  'he  cost-etfectiveness  oi  oistnbuteo- 
processmg  networks 

•  Study  of  oistnOuted  aiqontnms  m  networks 

•  Investigation  of  how  looseiy  couoMO  seif-orqanizing  autom¬ 
atons  can  demonstrate  expedient  oenavior 

•  Oeveiooment  of  a  macroscooic  theory  of  dismbuted  sys¬ 
tems 

•  Understanding  how  to  average  over  .aigontfims.  architec¬ 
tures.  and  topologies  to  provide  meanmgtui  measures  of 
system  performance 


.As  discussed  earlier,  enormous  bandwidths  are  neces¬ 
sary.  but  not  sufficient,  for  many  tightly  coupled  sys¬ 
tems.  The  latency  introduced  due  to  propagation  delay 
can  inhibit  tight  control.  (E.g..  if  we  transmit  data  into  a 
l-Gbit-per-second  iight  pipe  spanning  the  United  States, 
the  15.000-microsecond  propagation  delay  is  such  that 
the  first  bit  will  come  out  of  the  other  end  only  after  15 
million  bits  have  been  pumped  m!) 

This  planet  is  currently  laced  with  many  types  of 
computer/ communications  networks  at  all  levels. 

There  are  wide  area  networks,  packet-switched  net¬ 
works.  circuit-switched  networks,  satellite  networks, 
packet  radio  networks,  metropolitan  area  networks,  lo¬ 
cal  area  networks,  cellular  radio  networks,  and  more; 
and  they  are  .mostly  incompatible  within  each  type  and 
across  types.  .At  the  same  time,  the  end  user's  facility 
consists  of  telephones,  data  terminals.  Host  machines. 
PBX  switches,  alarm  systems,  video  systems.  F.AX  ma¬ 
chines.  etc.  The  incompatibility  problem  escalates! 

What  IS  neeoed  in  a  distributed  system  is  a  standard 
digital  communication  service  to  connect  the  many 
user  devices  with  one  another  across  the  room  or  across 
the  world.  Fortunately,  there  is  a  worldwide  movement 
to  define  and  adopt  an  integrated  solution  to  this  prob¬ 
lem.  which  has  given  rise  lo  the  Integrated  Services 
Digital  Network  (ISDN).  The  ISDN  service  defines  a  cus¬ 
tomer  interface  |a  plug  in  the  wall)  to  which  the  user's 
devices  can  attach  and  gain  access  to  the  worldwide 
integrated  digital  network.  W'e  are  not  likely  to  see 
much  definition  and  penetration  of  ISDN  until  the  end 
of  this  oecade  and.  possibly,  into  the  next  decade  land 
most  likely  it  will  first  appear  at  the  local  network 
level). 

What  all  this  should  tell  us  is  that  we  are  approach¬ 
ing  a  time  when  massive  connectivity  among  devices 
and  systems  will  exist.  Such  connectivity  is  necessary  if 
we  are  to  derive  the  full  benefits  from  distrtbuted  sys¬ 
tems. 

.At  the  processor  technology  level,  perhaps  the  most 
dramatic  development  is  the  gathering  momentum  in 
the  proliferation  of  personal  workstations.  They  are 
spearheading  the  drive  toward  distributed  systems.  .At 
'he  other  end  of  the  spectrum,  parallel  machine  archi- 


-'•‘V 


lectures  are  being  proposed  all  over  the  world  to  in¬ 
crease  the  processing  capacity  that  can  be  applied  to  a 
single  problem.  Both  of  these  technologies  are  moving 
very  rapidly  and  are  putting  pressure  on  distributed- 
systems  research  and  development.  We  are  seeing  the 
development  of  massively  distributed  architectures  that 
can  be  configured  as  tightly  coupled,  loosely  coupled, 
or  even  hierarchically  structured  systems. 

Massively  distributed  and  massively  connected  sys¬ 
tems  with  enormous  computational  capacity  are  likely 
to  appear  in  the  next  10  years.  Unless  we  pay  very 
careful  attention  to  the  user  interface,  users  will  be 
hopelessly  lost  and  ineffective.  At  the  very  least,  we 
must  provide  users  with  languages  that  allow  them  to 
take  advantage  of  the  distributed  architecture  and  to 
write  application  code  quickly  and  in  a  way  that  allows 
the  application  package  to  be  modified  and  maintained 
easily.  Moreover,  the  complexity  of  the  system  should 
be  transparent  to  users.  Users  need  to  interface  with  a 
systemwide  operating  system  that  offers  the  use  of  a 
single  logon  (with  a  networkwide  name  and  password) 
and  that  provides  access  to  file  servers,  database  serv¬ 
ers.  automatic  backup,  processing  servers,  mail  servers, 
doplication  packages,  education  and  help  functions,  etc. 

The  system  itself  could  take  advantage  of  expert- 
systems  capability  in  providing  these  services  to  the 
user.  .And  the  system  is  likely  to  include  extensive  re¬ 
dundancy  in  order  to  provide  high  levels  of  reliability 
and  fault  tolerance.  It  should  also  be  self-repairing,  and 
even  self-organizing,  as  the  conditions  and  demands  on 
it  change 

.Aside  from  the  business-oriented  applications  and 
developments  listed  above,  an  enormous  consumer- 
oriented  set  of  products  will  be  developed.  One  device 
that  spans  business  and  personal  needs  is  a  proper  "lap" 
computer  that  will  provide  the  user  with  remote  access 
to  the  massive  distributed  network  resources  described 
in  this  article. 

We  foresee  a  new  phenomenon  whereby  users  are 
confronted  with  so  many  attractive  features  in  new  de¬ 
vices  and  software  packages  that  they  cannot  possibly 
learn  to  use  them  all.  Learning  how  to  use  the  features 
represents  an  investment  far  beyond  users'  available 
time;  and  yet  the  features  are  wonderfully  seductive. 

To  coin  a  term.  1  would  like  to  refer  to  this  phenome¬ 
non  as  "FEATURE  SHOCK"' 

.As  we  observe  the  growth  of  our  man-made  distrib¬ 
uted  systems,  we  wonder  how  the  ants.  bees,  birds, 
fish,  and  higher  animals  have  managed  to  perform  so 
well  with  their  distributed  systems.  If  we  are  ever  to 
achieve  a  level  of  performance  anywhere  near  theirs, 
we  will  have  to  further  uncover  the  underlying  princi¬ 
ples  of  distributed-systems  behavior  (see  sidebar).  We 
have  discussed  some  of  these  in  this  article,  but  there  is 
much  new  ground  to  be  broken.  .Almost  anywhere  you 
dig  you  are  likely  to  find  pay  dirt.  The  field  is  wide 
open  for  new  ideas  and  new  approaches,  challenging 
problems  remain  unsolved,  and  the  application  of  new 
results  vvill  be  widespread  and  rapid — what  lovelier  en¬ 
vironment  could  you  seek? 
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BROADCAST  COMMUNICATIONS  AND  DISTRIBUTED  ALGORITHMS 


Rina  Dechter 
Leonard  Kleinrock 


Abstract 

The  paper  addresses  ways  in  which  one  can  use  "broadcast  communication"  in  distributed 
algorithms  and  the  relevant  issues  of  design  and  complexity.  We  present  an  algorithm  for  merging 
k  sorted  lists  of  elements  using  k  processors  and  prove  its  worst  case  complexity  to  be  2n, 

regardless  of  the  number  of  processors  ,  while  neglecting  the  cost  arising  from  possible  conflicts 
on  the  broadcast  channel.  We  also  show  that  this  algorithm  is  optimal  under  single-channel 
broadcast  communication.  In  a  variation  of  the  algorithm  we  show  that  by  using  an  extra  local 
memory  of  0(k)  the  number  of  broadcasts  is  reduced  to  n .  When  the  algorithm  is  used  for  sorting 
n  elements  with  k  processors,  where  each  processor  sorts  its  own  list  first  and  then  merging,  it  has 

a  complexity  of  0(-^  log  **  asymptotically  optimal  for  large  n .  We  also  discuss 

the  cost  incurred  by  the  channel  access  scheme  and  prove  that  resolving  conflicts  whenever!:  pro¬ 
cessors  are  involved  introduces  a  cost  factor  of  at  least  log!:. 

1.  Introduction 

Consider  the  following  algonthm  for  finding  the  maximum  of  a  set  of  k  distinct  numeri¬ 
cally  valued  elements  where  each  element  is  stored  within  a  separate  processor.  When  the  algo¬ 
rithm  begins,  each  processor  attempts  to  transmit  its  own  value  using  a  common  broadcast  chan¬ 
nel  to  which  all  processors  listen.  However,  only  one  processor  is  enabled  (permittedl  to  transmit 
by  means  of  some  access  scheme  (conflict  resolution  scheme).  Each  processor  compares  its  value 
with  the  largest  value  transmitted  so  far.  All  processors  that  have  a  larger  value  try  again  to 


broadcast  their  own  values,  etc.  The  algorithm  terminates  when  all  processors  have  either 
transmitted  their  values  or  have  "given  up"  which  is  detected  by  silence  on  the  channel.  The  last 
element  to  be  broadcast  is  the  maximum. 

This  admittedly  simple  algorithm  (referred  to  as  the  "Max-Algorithm"  and  also  presented 
in  [16]  )  demonstrates  how  a  distributed  algorithm  can  utilize  broadcast  communicadon.  The 
term  broadcast  implies  the  existence  of  a  single  channel  on  which  only  one  node  (processor)  can 
transmit  at  one  time  while  all  the  others  receive  the  message  simultaneously. 

"Algorithms  by  broadcasdng"  have  not  received  much  attendon  in  the  literature  on  paral¬ 
lel  and  distributed  algorithms.  An  earlier  report  by  the  authors  [8]  was  among  the  first  discussions 
of  such  algorithms  and  those  results  consdtute  a  pordon  of  the  current  paper.  Other  contiibudons, 
appearing  at  around  the  same  dme,  are  [15, 16].  In  this  introduction  we  survey  the  motivation  for 
using  broadcasting  as  a  model  for  distributed  computation,  point  out  its  unique  feamres,  summar¬ 
ize  relevant  work  and  point  out  our  contribution. 

In  the  area  of  parallel  algorithms,  the  closest  thing  to  broadcasting  is  the  assumption  of 
the  existence  of  a  global,  or  a  shared,  memory  from  which  all  the  processors  can  simultaneously 
read  the  same  value  [5, 20].  However,  shared  memory  models  usually  do  not  place  any  limit  on 
the  number  of  memory  cells  whjch  are  used  by  the  processors.  In  the  context  of  broadcasting  this 
would  mean  that  there  is  more  than  one  broadcast  channel  and  that  each  processor  can  use  any 
channel  according  to  the  requirements  of  the  algonthm.  In  most  broadcast-based  networks  (e.g. 
local  area  networks  [18]  )  there  is  only  one  channel  shared  by  all  processors.  Therefore,  most  of 
the  results  for  shared  memory  models  are  not  applicable  to  broadcasting.  .An  example  in  which 
the  results  are  applicable  is  the  search  algonthm  presented  by  Smr  [20]  using  the  CREW  icon- 


A  major  difficulty  in  using  broadcast  communication  is  the  issue  of  access  to  the  channel. 
Many  access  schemes  have  been  proposed  and  analyzed  [22],  The  focus  of  most  papers  is  on  how 
to  increase  channel  capacity  and  on  the  tradeoff  between  throughput  and  delay.  Cleariy,  the 
access  scheme  may  have  a  significant  impact  on  the  complexity  of  the  algorithm.  We  will 
approach  this  problem  through  two  models.  In  both  models,  processors  broadcast  one  value  at  a 
time  on  a  channel  shared  by  all  of  them.  Only  one  message  will  be  posted  at  a  slot  on  the  channel, 
and  ail  processors  can  read  the  posted  message.  In  our  first  model,  named  IPABM  (Ideal  Parallel 
Broadcast  Model),  we  assume  that  some  "ideal"  access  scheme  exists,  i.e.,  if  several  processors 
demand  the  use  of  the  channel  at  the  same  time,  there  is  a  global  mechanism  which  enables  one  of 
them  to  transmit  in  a  constant  time.  Later  it  is  refined  into  a  "more  realistic"  model,  RPABM 
(Realistic  Parallel  Broadcast  Model),  that  incorporates  a  conflict  resolution  protocol  (CRP),  and 
we  discuss  two  specific  access  schemes  and  their  influence  on  the  time  complexity  of  the  algo¬ 
rithms. 


The  vehicle  we  use  to  study  algorithms  by  bivsdcasdng  is  via  "comparison  based"  algo¬ 
rithms  (sorting,  searching,  etc.).  Parallel  versions  of  these  algorithms  have  been  extensively  stu¬ 
died  under  various  models  of  communications  [2,3, 5, 7, 19,20,21,23,24]  and  therefore  they  are 
well  suited  for  studying  the  power  and  limitations  of  broadcast  communication.  For  a  review  of 
sorting  algonthms  see  also  [9]. 

Relevant  work  in  the  area  of  broadcast  algonthms  includes  the  work  by  Levitan  [15]  who 
uses  a  PBM  (Broadcast  Protocol  .Multiprocessor)  model  which  is  identical  to  our  IP.ABM,  and 
obtains  results  similar  to  those  given  m  the  current  paper,  in  particular  he  presents  a  sorting  algo- 
nthm  which  is  identical  to  our  second  version  of  the  merge  aigonthm  when  all  processors  have 
just  one  element.  In  addition,  he  gives  an  aigonthm  for  finding  a  minimum  spanning  tree  in  a 
graph.  Algonthms  for  finding  the  extrema  in  a  broadcast  model  are  presented  in  [  16]  and  [4],  The 
latter  uses  a  mixed  model  that,  in  addition  to  the  conventional  links,  also  allov^s  a  global  bus  for 
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broadcast  communication. 


The  main  contributions  of  this  paper  are:  an  efficient  algorithm  for  merging,  proving  its 
optimality  and  dealing  in  a  formal  way  with  issues  that  emerge  from  the  use  of  broadcasting  as  a 
model  for  distributed  computing.  In  pardcular,  in  the  analysis  we  take  into  account  both  the  com¬ 
munication  cost  (time  to  broadcast  a  message)  and  the  computation  cost  (time  for  performing  a 
comparison).  We  also  discuss  the  overhead  introduced  by  different  access  schemes. 

In  the  next  section  we  present  our  model,  discuss  complexity  issues  and  analyze  the  per¬ 
formance  of  the  Max-algorithm  discussed  above.  In  Section  3  we  present  an  algonthm  for  merg¬ 
ing  k  sorted  lists  of  elements  each.  We  show  that  the  worst  case  performance  of  the  algorithm 

is  independent  of  the  number  of  processors  (which  is  also  the  number  of  lists),  and  is  bounded 
from  above  by  In  A  broadcasts  and  the  same  number  of  comparison  stages.  A  comparison  stage 
is  one  time  slot  in  which  several  processors  in  parallel  perform  one  comparison.  In  a  variation  of 
this  algorithm  we  show  (Section  3.3)  that  an  additional  0(4 )  storage  in  each  processor  can  reduce 
the  number  of  messages  broadcasted  to  n .  Using  the  merge  algorithm  for  sorting  n  elements  with 

k  processors  (Section  3.4)  yields  a  worst  case  time  complexity  of  O(ylog-^-i-n).  Thus  for  large  n, 

the  Merge-sort  algonthm  achieves  an  asymptotic  speed-up  ratio  of  k  with  respect  to  the  best 
sequential  (i.e.  single  processor)  algonthm  whose  comple.xity  is  0(^  lo^n ).  In  Section  4  we  show 
the  optimality  of  the  .Merge  algorithm  by  proving  that  any  Merge-by-broadcast  algonthm  requires 
n  broadcasts.  Section  5  addresses  access  scheme  issues. 

2.  The  IP.\BM  model  and  complexity  measures 

The  model  which  is  used  through  the  most  of  the  paper  is  presented  next.  Let  us  define  an 
IPABM  as  a  collection  of  processors  which  compute  in  parallel  synchronously  and  which  com¬ 
municate  via  a  single  broadcast  channel.  The  channel  is  slotted  into  time  slots  of  size  T  (where  T 
IS  the  time  for  a  message  transmission).  At  each  >tep  each  processor  can  read  th,e  message  in  the 


current  slot  on  the  channel,  do  some  computation  and  submit  a  message  to  be  broadcast  in  the 
next  time  slot  Any  number  of  processors  can  read  the  current  message  on  the  channel  but  only 
one  message  among  those  submitted  for  transmission  will  be  chosen  by  the  global  access  mechan¬ 
ism  to  be  broadcast  in  the  next  time  slot  An  empty  slot  indicates  that  no  processor  wants  to  talk. 

The  complexity  of  an  algorithm  will  be  measured  by  its  computing  time  and  its  communi¬ 
cation  dme.  Dealing  with  comparison  based  algorithms,  we  consider  a  comparison  operation  as 
the  basic  computation  step  and  the  broadcast  of  a  message  as  the  basic  communication  step.  Thus 
the  "number  of  comparisons"  (^comparisons)  performed  in  parallel  and  the  "number  of  broad¬ 
casts"  (#broadcasts)  charactehze  the  computation  time  and  broadcast  tune,  respectively.  Let  t  be 
the  time  for  a  comparison  operation  (since  the  algorithms  we  consider  are  synchronized  the 
analysis  does  not  take  into  account  variations  in  computing  dme  among  processors).  The  above 
two  measures  are  combined  as  follows: 

T {A  )={  {ttcomparisons  >t-T  (^broadcasts )  ( 1 ) 

where  TCA)  stands  for  the  worst  case  dme  complexity  of  algonthm  A.  By  T(A)  we  denote  the 
average  dme  complexity  of  algorithm  A  over  all  problem  instances. 

In  the  Max-Algorithm  each  broadcast  is  followed  by  one  comparison  operadon  performed 
in  parallel  by  some  of  the  processors.  Therefore  it  is  sufficient  to  account  only  for  ^broadcasts  as 
the  measure  of  complexity. 

Let  T(Max(l:))  be  the  number  of  broadcasts  performed  by  the  Max  algorithm  with  k  ele¬ 
ments  and  k  processors.  On  some  input  instances  Maxii)  will  require  each  element  to  be  broad¬ 
cast  and  thus 

T{Max(k))=k.  (2) 
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The  average  number  of  broadcasts,  T  (Max  {k ),  obeys  the  following  recurrence; 

T(Max  (i ))  =  1  +  i  T(Max  (k  -i ))  (3) 

This  last  is  true  since  if  the  first  element  to  be  broadcast  is  the  1  “*  smallest  element,  then  exactly 
i-1  among  the  rest  of  the  i-l  processors  will  remain  silent  and  will  not  participate  in  the  rest  of 
the  algorithm.  We  assume  a  homogeneous  distribution  of  the  elements  among  the  processors  and 

thus  the  above  event  has  probability  and  the  recurrence  follows.  (3)  can  be  wrinen  as: 

r(iWai(jk))=  l+-[-*£r(,Wax(i))  (4) 

with  r{iVfax(0))=0,r (Max  (1))=0.  Solving  this  recurrence  yields 

T  (Max  (Ic ))  =  £  -7-  S/aj  k  (5) 

i»i  * 

The  same  results  were  obtained  in  [16]  using  a  slightly  different  analysis. 

3.  Parallel  Merge  by  Broadcast 
3.1  Description  of  the  algorithm 

We  present  a  disnibuted  algorithm  that  merges  k  soned  lists  of  ^  distinct  elements  into  a 

decreasing  senes,  using  an  FT  ABM  with  <-l  processors.  Each  of  the  first  k  processors  contains 
one  of  the  lists  and  each  has  an  identity  (id;.)  and  a  local  memory.  The  size  of  local  memory  is 
fixed  and  not  dependent  on  i  or  n.  For  simplicity  we  designate  the  (k*lY'  processor  to  be  the 
"output  processor"  (the  one  in  which  the  output  will  be  stored.)  All  processors  cooperatively  parti¬ 
cipate  in  the  task  of  merging  the  soned  lists  they  possess.  The  maximum  element  in  each 
processor’s  list  is  called  its  "current  value". 

The  algorithm  can  be  decomposed  into  cycles.  In  each  cycle  the  maximum  of  the  current 
values  is  determined.  This  element  is  broadcast  to  the  output  processor  as  the  next  element  in  the 
merged  list,  and  is  removed  from  the  processor  to  which  it  belonged  tihe  processor  updates  its 


current  value).  Each  cycle  is  implemented  by  the  Max-Algorithm  presented  earlier.  The  proces¬ 
sor  that  broadcasts  first  is  the  initiator  of  the  cycle;  the  last,  is  the  terminator  of  the  cycle.  E>uring 
the  cycle  processors  try  to  broadcast  their  current  values  as  long  as  they  haven’t  heard  yet  a  larger 
value  being  broadcast.  In  order  to  eliminate  redundant  broadcasts  there  will  be  some  dependency 
between  the  iniriations  of  cycles. 

When  a  processor  succeeds  in  broadcasting  its  current  value,  it  denotes  the  value  which  is 
broadcasted  immediately  afterwards  as  its  successor.  The  successor  value  is  updated  each  time  the 
current  value  is  rebroadcast.  When  the  current  value  is  the  terminator  of  the  cycle,  it  has  no  suc¬ 
cessors.  The  current  value  is  ±e  predecessor  of  its  successor.  Each  current  value  will  have  at  most 
one  successor  at  any  given  time  (it  may  have  none,  if  it  was  not  broadcast  yet).  It  will  also  have  at 
most  one  predecessor.  In  terms  of  this  terminology,  the  rule  for  cycle  initiation  is  as  follows:  a 
processor  initiates  the  next  cycle  (by  rebroadcasdng  its  current  value)  if  the  present  cycle  was  ter¬ 
minated  by  a  successor  to  its  own  current  value.  If  there  is  no  predecessor  the  next  cycle  can  be 
initiated  by  any  processor. 

In  order  to  implement  the  above  algorithm  in  a  distributed  fashion,  processon  must  be 
able  to  detect  the  end  of  a  cycle  and  its  terminator,  and  to  determine  whether  or  not  they  should 
irutiate  the  next  cycle.  The  end  of  a  cycle  is  deieimined  by  silence  (i.e  an  empty  time  slot).  The 
initiator  of  a  cycle  is  determined  by  the  successor-predecessor  relanonship,  as  descnbed  earlier. 
Two  empty  slots  in  succession  indicate  that  a  cycle  is  terminated  but  that  a  specific  initiator  does 
not  exist,  in  which  case  all  processors  try  to  initiate  the  next  cycle. 

The  algorithm  for  each  processor  is  descnbed  in  Figure  1.  While  listening  to  the  channel  a 
processor  can  recognize  one  of  the  following  three  cases:  .A  value  is  broadcast  in  the  current  slot, 
(case  1)  or  ,  the  current  slot  is  empty  but  the  previous  one  holds  a  value  (case  2),  or  two  consecu¬ 
tive  empty  slots  have  occurred.  In  cases  2  and  3  a  cycle  has  terminated  and  an  initiator  must  be 
determined.  We  assume  that  each  processor  has  a  procedure  for  determining  the  terminator  of  a 


cycle  which  is  used  in  case  2.  The  procedure  process_update  is  used  each  time  the  processor  has 
successfully  broadcasted  its  current  value,  CV.  It  determines  whether  the  value  is  also  the  termina¬ 
tor  of  the  cycle,  or  whether,  the  successor  value,  SUCC,  should  be  updated.  If  it  is  the  terminator, 
the  value  at  the  top  of  its  list  is  removed  and  the  value  of  CV  is  updated  to  be  the  new  maximum 
in  its  list.  The  first  broadcast  of  each  current  value  may  terminate  the  cycle  in  which  it  partici¬ 
pated.  If  it  didn’t,  this  value  will  be  rebroadcast  only  for  initiating  future  cycles  until  it  will  ter¬ 
minate  one.  Then,  the  current  value  is  updated  and  the  processor  tries  to  broadcast  the  new 
current  value  for  the  first  time.  A  formal  proof  of  this  behavior  is  given  later. 

It  is  convenient  to  trace  the  execution  of  the  algorithm  using  a  global  work  stack  in  which 
the  values  being  broadcast  are  recorded.  The  work  stack  could  also  be  kept  in  each  of  the  proces¬ 
sors  and  thus  be  used  to  control  the  algorithm  (see  Section  3.3).  Here  it  is  utilized  only  to  explain 
the  rule  for  cycle  initiation. 

Consider  the  following  example.  Suppose  we  have  n»8,  k-4  and  the  initial  situation  is  as 

follows: 


The  execution  of  the  algorithm  is  traced  in  Figure  2.  For  each  cycle  we  give  the  sequence 
of  (processor,  element)  pairs  in  that  cycle,  the  element  determined  to  be  the  ne.xt  in  the  merged  list 
(i.e  the  output  list),  and  the  contents  of  the  working  stack.  The  algonthm  is  imtiated  by  processor 
P  and  we  assume  that  write  access  is  given  to  the  processor  with  the  smallest  id)r  among  those 
that  want  to  talk.  The  termination  of  a  cycle  is  detected  by  an  empty  time  slot.  Broadcasted  ele¬ 
ments  are  pushed  into  the  work  stack  as  they  are  heard.  The  Max  element  is  popped  from  the 
work  stack  and  joins  the  output  list.  The  next  cycle  is  then  initiated  by  the  top  element. 


begin 

/•initialization  •/ 

CV  <-  the  maximum  value  in  list 
SUCC  <-  nil 
TERM  <-  nil 
BV  <-  nil 

repeat 

BV  <-  read  next  msg 

1.  if  BV  is  not  empty  then 

if  CV  >  B  V  then  broadcast  CV 
if  successful  then  call  process_update 

2.  if  BV-empty  then  /^a  cycle  is  terminated*/ 

begin 

TERM  <-  terminator  of  the  cycle 
if  TERM  -  SUCC  then 

broadcast  &  call  process_update 
end 

3.  if  heard  two  empty  slots  then  try  to  broadcast  C  V 

if  successful  then  call  process_update 
until  list  is  empty 


process_update 

begin 

BV  <•  read  next  msg 
if  BV  ■  empty  then 
begin 

remove  CV  from  list 
if  list  is  empty  CV  <-  NIL  else 
CV  <-  next  element  in  list 
end 
else 

SUCC<-BV 

end 

Figure  1:  The  merge  algorithm 


In  the  first  cycle,  34  is  determined  to  be  the  Max  element.  It  is  removed  from  the  list  of 
processor  P ,,  the  highest  element  of  which  then  becomes  75.  The  second  cycle  is  initiated  by  pro¬ 
cessor  Pj  since  It  broadcast  immediately  before  Pj  in  cycle  1  and  so  on.  The  mechanism  of  a 
working  stack  suggests  that  rebroadcasdng  the  element  that  initiates  a  cycle  by  a  processor  could 
be  avoided  altogether  since  the  processors  already  heard  that  value  and  they  can  memonze  it. 
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Figure  2.  '"he  execution  cf  Merge  algorithm 


on  an  examole  oroblem 


Indeed,  this  is  the  basis  of  the  improved  Merge  algorithm  to  be  described  later. 

32  Correctness  and  complexity  analysis 

The  correcmess  of  the  algorithm  follows  immediately  from  the  following  three  facts; 

1 .  The  first  element  in  each  list  is  the  largest  in  that  list  at  all  dmes. 

2.  In  each  cycle  the  maximum  of  all  the  first  elements  is  determined. 

3.  The  determined  maximum  is  removed  from  its  list  and  added  to  the  output  list. 

In  the  complexity  analysis  we  calculate  only  the  number  of  broadcasts  performed,  since 
each  broadcast  is  followed  by  a  comparison  stage.  We  show  that  the  worst  case  complexity  is 
2n-l.  In  order  to  prove  that,  we  consider  the  following  two  lemmas. 

Lemma  1: 

Let  CV,  denote  the  current  value  of  processor  F, .  Whenever  CV,  is  rebroadcast  after  the  lint  time, 
it  initiates  a  cycle. 

Proof: 

•Assume  to  the  contrary  that  the  claim  is  not  correct.  There  is,  therefore,  a  situation  in  which  a 
processor,  that  had  already  broadcast  its  current  value,  later  hears  a  smaller  value  on  the  channel. 
The  processor,  m  response,  will  rebroadcast  its  current  value.  .Also,  any  current  value  that  was 
broadcast  must  have  a  successor  value  and  that  successor  is  larger  than  itself.  Consider  the  first 
cycle  in  which  this  situanon  occurs  and  let  Ct'  be  the  largest  among  the  already-heard  current 
values  at  the  time  of  this  cycle  that  hear  a  smaller  value  on  the  channel  (case  1  in  the  algonthm). 
Let  CV^  be  the  successor  of  CV,  at  uhat  time  (any  current  value  that  was  broadcast  must  have  a 
successor).  This  successor  is  larger  then  CV  and  uherefore  it  is  also  larger  then  the  value  on  the 
channel.  However,  since  we  picked  CV,  to  be  the  largest  one  with  this  property  we  get  a  contrad- 


icnon. 


□ 

It  follows  that  from  the  time  an  element  is  hrst  broadcast  until  it  is  merged,  only  values 
greater  than  or  equal  to  itself  can  be  broadcast  Let  #V;  be  the  number  of  rimes  element  V,  is 
broadcast  The  lemma  implies  that  if  #V^>1  then  V,  initiated  at  least  #V,-1  cycles.  From  lemma  1 
we  can  conclude  also  that  there  is  at  most  one  predecessor  to  each  current  value.  The  reason  is  that 
a  value  is  determined  as  a  successor  only  during  its  first  transmission  (it  caruiot  be  the  initiator  of 
that  cycle).  Therefore  it  will  be  a  successor  to  only  one  cunem  value. 

Lemma  2: 

Each  rime  V,  initiates  a  cycle  (except  for  the  last  cycle  which  includes  only  V,),  the  cycle  is  ter¬ 
minated  by  an  element  that  has  never  been  broadcast  before. 

Proof; 

Assume  #v;>2.  We  arbitrarily  choose  the  cycle  which  is  initiated  by  the  m**  broadcast  of 
(l<m<#V',).  Let  the  cycle  be  terminated  by  an  element  denoted  If  «,'*  participated  in  an  ear¬ 
lier  cycle,  it  must  initiate  all  other  cycles  in  which  it  participates  (according  to  lemma  1),  which 
leads  to  a  contradiction. 

□ 

This  lemma  implies  that  with  each  broadcast  of  V',  excluding  the  first  and  the  last  we  can 
associate  a  distinct  element  which  is  broadcast  just  once.  This  element  is  removed  immediately 
after  it  is  broadcast.  Since  no  two  distinct  elements  initiate  the  same  cycle,  we  can  partition  the 

set  { V . . '/, }  of  all  elements  into  disjoint  subsets  M  »  {5 .  J  ^ . }  such  that  each  subset  5, 

either  consists  of  a  single  element,  which  is  broadcast  once  or  twice,  or  5,  consists  of  m  elements, 
one  of  which,  V.,  is  broadcast  m-I  times  and  the  other  m-l  elements  are  those  which  terminate 
each  of  the  m-l  cycles  imuated  by  V',. 


1. 


5irvS/  =  '> 


2.  if  1  Si  1  -  m  then  the  total  number  of  broadcasts,  fl(Si)  by  elements  in  Si  ,  satisfies 

S(Si)S2m 

This  leads  to  the  following  theorem. 

Theorem  1: 

Let  T(rt  )  be  the  number  of  broadcasts  performed  by  the  Merge  algorithm.  T{njc)  satisfies 

n  ^T{njc)i2n-l  (6) 

Proof: 

It  is  obvious  that  n  ^Tinjc)  since  each  element  has  to  be  broadcast  at  least  once.  Also 

r(rtJk)S  2:fl(Si)S  I2IS.1  =2«  (7) 

S.»M  S,*M 

Since,  the  element  which  terminates  the  first  cycle  is  broadcast  just  once.  We  are  left  with  n-l 
elements  for  which  we  have  shown  in  (7)  that  the  upper  bound  for  their  total  broadcast  time  is 
2(/i-l),  which  yields; 

T(nJC)i2{n-iy*-l=i2n-l 

□ 

3  J  An  Improved  .Merge  .\lgorithm 

.As  menrioned  earlier,  it  is  possible  to  decrease  the  number  of  broadcasts  required  by  mak¬ 
ing  the  initiation  and  termination  of  a  cycle  more  sophisticated.  However,  these  savings  require  a 
larger  local  memory  for  each  processor. 

In  the  modified  version,  each  processor  stores  all  the  elements  that  were  broadcast  in  a 
stack  called  wstack  (as  we  did  in  the  example).  The  initiation  of  cycles  by  elements  that  were 
broadcast  before  will  be  avoided  altogether,  since  this  information  exists  in  the  wstack  of  every 
processor.  Each  cycle  now  begins  by  broadcasting  the  second  element  relanve  to  this  cycle  in  the 


original  algorithm,  and  cycles  with  one  element  will  now  reduce  to  empty  cycles. 

At  the  beginning  of  a  cycle  each  processor  compares  the  value  at  the  head  of  its  list 
with  the  top  element  in  the  wstack  and  decides  to  broadcast  only  if  its  value  is  larger.  The  rest  of 
the  cycle  proceeds  in  the  same  manner  as  in  the  previous  algorithm  where  each  processor  pushes 
values  onto  wstack  as  it  hears  them.  At  the  end  of  a  cycle,  (determined  by  an  empty  slot)  each 
processor  pops  the  top  element  from  wstack.  Any  consecutive  empty  slot,  following  the  first 
corresponds  to  an  empty  cycle  (a  cycle  of  1  element  in  the  previous  algorithm).  For  each  such 
empty  slot  a  processor  pops  its  wstack  and  this  value  joins  the  merged  list.  When  wstack  is  empty 
but  its  list  is  not  the  processor  knows  that  the  initiation  of  a  cycle  by  a  value  which  was  never 
broadcast  before  is  called  for. 

The  number  of  broadcasts  required  by  this  algorithm  is  exactly  n .  Each  element  is  broad¬ 
cast  just  once.  The  number  of  comparison  stages,  however,  remains  the  same  as  before.  Note  that 
the  new  algorithm  requires  that  each  processor  maintain  a  stack  of  size  0(k),  thus  presenting  a 
tradeoff  between  the  number  of  broadcasts  and  the  size  of  local  memory. 

In  the  rest  of  this  paper,  whenever  we  talk  about  the  "Merge  by  broadcast"  algonthm  we 
mean  the  first  version  unless  otherwise  specified. 

3.4  Sorting  algorithms 

The  Merge  by  broadcast  algonthms  can  be  used  to  sort  n  elements  with  k  processors  with 
EP.ABM  by  initially  having  each  processor  sort  its  own  list,  using  some  efficient  sequential  algo¬ 
rithm  (such  as  quicksort  or  sequential  .Merge  sort  [1]  ).  The  merging  phase  is  perrbrmed  by  our 
Merge-by  -broadcast  algorithm.  Lets  call  this  sorting  algonthm  Broad-Sonfn  Jc ). 

Theorem  3; 

The  complexity  of  Broad-Sortfn  Jc )  is  given  by; 


(8) 


T(,Broad-Son(nJc))i=(-^\og-^*2n-l)t  ^(In-DT  ) 

Proof: 

Here  we  use  the  combined  measure  of  performance  that  accounts  for  both  comparison  time  and 
communication  time. 

The  theorem  is  proved  as  followed: 

(-^log-^)  -  is  the  number  of  comparisons  required  to  sort  each  list. 

(2rt-l)  -  is  the  number  of  compansons  for  merging 
(2n-l)  -  is  the  number  of  broadcasts 

□ 

The  maximum  number  of  comparisons  required  for  sorting  a  sequence  of  n  elements  on  a 
sequential  processor  is  asymptotically  n  logn .  Therefore,  when  k  is  smaller  than  log  n  the  asymp¬ 
totic  speedup  ratio  of  the  the  optimal  sequential  algorithm  over  the  above  algorithm  is  k ,  which  is 
optimal.  However,  when  k  is  greater  than  log  n,  the  total  execution  time  required  is  asymptoti¬ 
cally  linear  in  n .  Formally  the  speed-up  ratio  between  sequential  sorting  and  this  Merge-Sort  algo- 
nthm  is  greater  then 


When  the  improved  Merge  algonthm  is  used  to  sort  n  elements  with  n  processors  {k=n ) 
then  we  get  the  algorithm  described  by  Levitan  [15].  This  algorithm  uses  exactly  n  broadcasts 
and  2/1  companson  stages.  In  this  case  (when  k=n)  the  number  of  comparison  stages  can  be 
reduced  to  n  by  having  the  elements  which  were  not  transmmed  yet,  keep  a  pointer  to  their  rela¬ 
tive  order  in  the  wstack.  Each  processor  can  do  that  with  no  extra  cost  since  they  listen  to  the 
channel  anyway.  In  that  case  when  a  new  cycle  begins,  with  the  top  element  in  the  wstack.  the  ele¬ 
ments  know  their  relation  to  it  and  they  don't  need  to  make  the  companson. 
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This  argument  cannot  be  extended  to  the  Merge-algorithm  since  the  elements  that  are 
updated  to  be  the  new  current  values  were  not  compared  with  the  values  that  were  broadcasted 
already. 

Another  way  to  son  n  elements  with  n  processors  is  as  follows.  Each  processor  broad¬ 
casts  its  value  in  a  prespecified  order.  Each  processor  listens  to  the  channel  and  remembers  the 
value  of  its  immediate  successor  in  the  list.  After  n  broadcasts,  ail  processors  are  linked  in  the 
order  of  their  values.  In  the  next  phase  the  processors  will  broadcast  their  values  again  according 
to  the  order  dictated  by  ±e  linked  list.  The  first  will  be  the  max  value  (the  one  with  no  succes¬ 
sors).  This  algorithm  uses  2n  broadcasts  and  n  comparison  stages.  Its  advantage  over  Levitan’s 
algorithm  is  that  there  is  no  contention  on  the  chaimel,  and  therefore  in  the  RPABM  model  (to  be 
discussed  later)  no  extra  cost  will  have  to  be  paid  for  accessing  the  channel.  This  algorithm  can¬ 
not  be  extended  to  a  Merge-algorithm  (at  least  not  in  straight-forward  way)  without  increasing  the 
number  of  comparison  stages  significantly. 

4.  Optimality  of  the  .Merge-Algorithm 

In  this  section  we  establish  a  lower  bound  on  the  performance  of  all  "merge  by  broadcast" 
algorithms  when  the  only  cntenon  is  the  number  of  broadcasts  performed.  Consequently,  the 
above  Merge  algorithms  are  shown  to  be  optimal  since  they  met  this  bound. 

We  consider  all  possible  Merge  algorithms  using  the  IP.ABM  to  merge  n  distinct  ele¬ 
ments  drawn  from  a  set  S  on  which  an  order  is  defined.  The  n  elements  are  grouped  into  k  sorted 

lists  each  with  elements.  There  are  k  processors,  each  containing  one  of  the  soned  lists.  The 

output  is  obtained  m  an  independent  processor  (the  output  processor).  We  claim  that  any  algo- 
nthm  that  merges  the  lists  requires  at  least  n  broadcasts.  .More  specifically  it  requires  that  each  of 
the  elements  will  be  broadcast  at  least  once. 


i 


This  claim  might  seem  trivial  since  for  the  output  processor  to  create  the  merged  list  it 
must  hear  all  the  values!  However  if  we  reformulate  the  requirement  such  that  the  output  proces¬ 
sor  need  not  know  the  actual  values  but  simply  their  order,  the  claim  is  less  obvious.  Formally,  let 
V'l  be  the  j*  element  in  processor/*,  (or the  j*  element  in  the  i*  sorted  list)  and  be  the  value 
of  the  element  in  this  location  for  a  given  input  of  n  elements.  The  output  processor  should  be 

able  to  give,  for  each  input,  a  series  of  locations  . . such  that  the 

sequence  of  values  . . is  the  final  merged  list.  From  what  we  will  show  it 

follows  that  the  values  themselves  are  also  available  at  the  output  processor.  First  we  prove  our 
claim  for  the  special  case  of  n  lists  having  1  element  each. 

Lemma  3: 

To  merge  n  lists  of  I  element  each,  with  IP  .ABM  using  n  processors,  each  of  the  elements  must  be 
broadcast. 

Proof  (sketch): 

.Any  two  elements  which  are  adjacent  in  the  sorted  list  must  be  compared  directly.  Let  a,, a,,,  be 
two  consecutive  elements.  The  order  between  these  two  elements  and  the  rest  is  exactly  the  same, 
thus,  to  determine  their  internal  order  they  must  be  compared  direcdy.  Since  a.  and  a.,i  are 
located  in  different  processors  and  the  outcome  of  the  companson  must  be  available  at  the  output 
processor  which  doesn’t  know  them  initially,  both  values  must  be  broadcast.  Each  one  of  the  pro¬ 
cessors  including  the  output  processor  can  then  compare  and  determine  the  order  between  the  two 
elements.  It  might  be  argued  that  it  is  sufficient  to  have  one  processor  broadcast  its  value  and  the 
other  only  indicate  whether  it  is  larger  or  smaller,  however,  we  count  all  messages  in  the  same 
way  and  since  the  broadcast  of  the  value  gives  more  information  we  assume  the  values  themselves 
are  transmitted.  We  can  conclude  chat  since  there  are  n-l  adjacent  pairs,  n-I  comparisons  are 
required,  each  corresponding  to  two  broadcasts.  Since  n-2  of  the  elements  paracipate  in  two  adja¬ 
cent  pairs  the  number  of  broadcasts  required  is  at  least 
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2(rt-l)-{/t-2)  =  n 


(9) 

□ 


In  the  general  case,  each  processor  has  —  sorted  elements.  As  argued  earlier,  any  adjacent 

pair  of  elements  must  be  compared.  If  two  adjacent  elements  are  in  the  same  processor  it  is  not 
necessary  to  broadcast  both  elements  in  order  to  make  a  comparison  (a  processor  can  make  the 
comparison  and  then  just  broadcast  the  result  in  some  coding  or  broadcast  only  the  larger  value). 
However,  it  is  possible  to  create  an  input  for  which  no  two  adjacent  elements  are  located  in  the 
same  processor.  Figure  3  illustrates  such  a  case.  Let  be  the  sorted  list.  The  list  of 

elements  in  processor  /•.  { ^ )  is  a,  ,a,  ^ . a  t.. 


s 

^2 

Figure  3:  A  worst  case  example 

Thus,  any  comparison  between  adjacent  values  requires  that  both  of  them  must  be  broadcast.  This 
yields  the  following  theorem. 

Theorem  2: 

•Any  .Merge-by-broadcast  algorithm,  using  CP.ABM.  with  n  elements  and  k.  lists  ik  processors) 
requires  n  broadcasts  in  the  worst  case  if  the  output  is  accumulated  in  a  separate  output  processor. 

□ 

It  is  easy  to  show  that  when  the  output  processor  is  a  processor  which  contains  one  of  the 
lists  the  lower  bound  decreases  to  a  -  y  broadcasts  (the  output  processor  need  not  broadcast  its 
own  values). 


-**'*«  *■' 


.1-  • .  • . 


From  Theorems  1  and  2  we  conclude  that  the  Merge  by  broadcast  algonthm  presented  in 
Section  3  is  asymptotically  optimal  w.r.t.  number  of  broadcasts,  while  its  modified  version  is 
absolutely  optimal. 

5.  Access  Scheme  Considerations 

So  far  in  our  discussion  we  have  assumed  the  existence  of  an  "ideal"  access  scheme  that 
resolves  all  conflicts  in  constant  time.  Even  if  such  an  access  scheme  is  not  available,  this 
assumption  is  appropnate  when  the  algorithm  itself  is  designed  so  that  conflicts  never  arise  (i.e., 
in  each  time  slot  at  most  one  processor  is  enabled)  as  in  the  last  sorting  algorithm  in  secrion  3.4. 
If  the  algorithm  is  not  designed  in  this  way,  conflicts  between  processors  will  generally  arise  and 
the  access  scheme  used  may  have  a  profound  effect  on  the  complexity  of  the  algorithm. 

Numerous  access  schemes  have  been  proposed  and  analyzed  [6, 10, 12, 13, 1 1, 22]  in  the 
context  of  broadcast  communications.  The  measures  that  are  used  for  evaluating  their  perfor¬ 
mance  are  those  of  channel  capacity,  throughput  and  delay.  Of  most  interest  to  us  is  capacity;  in 
particular,  we  are  interested  in  the  ratio  between  the  dme  the  channel  is  used  for  conflict  resolu¬ 
tion  and  the  ume  it  is  used  for  "useful"  communication. 

We  now  modify  our  ideal  model  to  include  these  access  consideraaons.  Let  RP.4BM  be 
an  CP  .ABM  with  the  following  changes:  the  channel  is  sloned  into  two  types  of  time  slots:  one  of 
length  T  for  message  transmission  and  the  other  one  of  length  r.  The  RPAB.M  has  a  Conflict 
Resolution  Protocol,  CRP,  which  subsDtutes  for  the  global  access  mechanism  in  IP  .ABM.  The 
CRP  is  invoked  whenever  a  conflict  anses  and  it  uses  a  r-slotted  zhanne!.  .A  T  slot  carrying  a 
transmission  marks  the  termination  of  the  CRP.  Usually  z«T. 

The  communication  time  of  an  algonthm  with  RPAB.M  .now  includes  the  number  of  T 
slots  required  for  real  transmissions  and  the  number  of  t  slots  used  by  the  CRP.  Let  eT,  -r.  be  the 
maximum  number  of  T  slots  and  the  maximum  number  of  z  slots  respectively,  used  by  algonthm 


A  for  broadcasts  and  for  resolving  conflicts  given  a  specific  CRP.  The  worst  case  communication 
time  for  algorithm  A  with  CRP,  denoted  by  Comm(A.CRP)  is  defined  as: 

Comm(A.CiJP)  =  #Tr+#TX  (10) 

Since  x«T,  the  contribution  of  the  second  term,  (the  cost  of  accessing),  might  be  negligible  for 

some  specific  parameters  of  the  problem.  However,  for  the  asymptotic  complexity,  t  cannot  be 
ignored.  The  ratio  —  characterizes  the  impact  of  the  access  scheme  used  on  the  asymptotic  com- 

plexity  of  the  algorithm.  Specifically,  if  -  0(f(l:)),  where  fl/k)  is  a  nondecreasing  function  of 

tfT 

k  (the  number  of  processors),  then  it  is  easy  to  see  that 

Comm(A,CJiP)  =  0(»T  (/(k)-*-!))  (11) 

In  the  next  two  subsections  we  present  the  "Merge  by  broadcast"  algorithm  using  two 
realistic  access  schemes:  MSAP  (mini-  slot  alternating  priority)  [11]  and  the  Tree-Algorithm  [6]. 
Following  that  we  give  some  lower  bounds  for  CRP  performance. 

S,1  Merge  with  MSAP 

In  the  MSAP  scheme,  processors  obtain  permission  to  broadcast  according  to  some 
predetermined  priority  order  among  them  dictated,  for  example,  by  their  id’s.  Processors  may 
broadcast  one  after  the  other  in  a  round  robin  fashion,  and  when  a  processor  has  nothing  to  say  its 
mm  is  passed  to  the  next  in  order  after  an  empty  t-slot.  When  it  does  broadcast  it  uses  a  T  slot, 
and  then  gives  the  mm  to  the  next  processor.  This  is  similar  to  a  token  nng. 

This  method  is  very  appealing  for  the  Merge-algorithm  since  we  have  cycles  built  into  it 
in  which  each  processor  might  want  to  transmit  once.  Thus,  integrating  the  access  scheme  into 
the  Merge-algorithm  is  straightforward;  processors  are  ordered  in  increasing  order  of  their  id’s. 
Any  cycle  of  the  algorithm  is  initiated  by  its  initiator  if  the  algorithm  determines  one.  Otherwise 
P;  is  the  initiator.  If  P,  is  an  irutiator  of  a  cycle  then  the  next  one  that  can  talk  is  P,,|  ,  and  then 
P.»2  and  so  on  until  the  end  of  the  cvcle.  .-V  processor  determines  when  its  mm  comes  by  detecnng 


the  end  of  a  cycle,  by  knowing  the  initiator’s  id  and  by  counting  the  number  of  t  and  T  slots  that 
occurred. 


In  Figure  4  we  show  how  the  algorithm  works  with  MSAP  using  the  same  input  as  in  Fig¬ 
ure  2.  Note  that  we  do  not  need  an  extra  time  slot  to  determine  the  end  of  a  cycle  with  this  CRP. 
The  empty,  slots  are  of  length  x  and  are  noted  by  "  ",  the  full  slots  are  of  length  T. 
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cycle  2: 
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Figure  4:  Example  of  .Merge  with  MSAP 

We  next  determine  the  communication  complexity  of  the  .Merge  algonthm  with  the  MS.AP  CRP. 
Theorem  4: 

For  any  instance  I  of  Merge(n.>:)  with  RPABM  when  CRP  is  .MSAP.  the  number  of  x  slots  used, 
4t,  satisfies 


(12) 


s»t<n  (ife-l) 

Proof: 

Since  the  number  of  cycles  is  n  and  in  each  cycle  we  can  have  at  most  k- 1  empty  x-slots  (when 
there  is  just  one  transmission  for  example)  there  are  at  most  n  ■  (k-1)  t  slots  i.e.: 

#  xSn  (jk-l) 

We  now  determine  the  lower  bound.  An  element  is  added  to  the  Merged  list  when  it  terminates  a 
cycle.  To  determine  if  a  cycle  is  terminated  by  an  element  of  processor  P. ,  k-i  empty  t  -slots  must 

pass  by.  Since  there  are  elements  in  processor  i  and  each  terminates  a  cycle  once  we  have 

(13) 

*  .=0  ^ 

□ 

We  can  conclude  that  using  MS AP  we  have  #t  -  ©  (/i  k).  Altogether 

Comm  {Merge  (n  Je  )J>1SAP  )=(2n  -1)  •  T  Hn  k)-  x=Q{n  k)  (14) 

Note  that 

^  =  0(k)  (15) 

Thus  the  asymptotic  complexity  increased  by  a  factor  of  k  which  is  substandal  when  k  is  not  a 
constant  but  rather  a  function  of  n . 

52  Merge  with  the  Tree-Algorithm 

To  apply  the  MSAP  access  scheme  to  the  .Merge-aJgorithm  we  took  advantage  of  the 
structure  of  the  algorithm  and  its  decomposinon  into  cycles.  The  access  scheme  resolves  conflicts 
for  each  cycle  of  transmissions  and  not  for  each  transmission  independently. 

Our  approach  in  the  following  scheme  is  to  consider  each  conflict  independently  without 
acquiring  information  from  previous  conflicts  or  previous  steps  of  the  algorithm.  Whenever  a 
conflict  occurs,  the  CRP  is  invoked  and  gives  the  nght  to  transmit  to  one  of  the  involved  proces- 


sore.  The  question  we  want  to  address  is:  Given  k  processors  that  want  to  transmit,  how  many 
time  slots  (for  conflict  resolution)  are  needed  until  one  of  the  processors  succeed  to  broadcast  (this 
is  essentially  the  well  known  election  problem  in  distributed  computing).  Greenberg  [10] 
addresses  similiar  questions  however  none  of  them  is  identical  to  our  question  and  therefore  none 
of  his  results  are  the  same.  For  instance  he  considers  the  problem  of  having  k  processors  that 
want  to  talk  in  the  same  slot  and  provides  a  probabilistic  protocol  which  on  the  average  enables  all 
the  k  messages  to  be  posted  in  time  0(k).  In  our  case,  however,  the  situation  changes  after  each 
successful  broadcast,  namely,  if  a  processor  wanted  to  talk  and  another  processor  was  chosen  to 
broadcast ,  it  may  not  want  to  talk  after  it  heard  the  broadcast  message. 

We  apply  Capetanakis’  tree  algorithm  [6]  to  resolve  conflicts  using,  again,  the  processor 
id’s.  The  Tree-CRP  is  described  by  the  recursive  procedure  Tree(rj)  in  Figure  5.  Whenever  a  pro¬ 
cessor  wants  to  transmit,  the  procedure  Tree  (Ijk)  is  invoked,  where  k  is  the  number  of  proces¬ 
sors. 


Tree(r  j)  /“executed  by  processor  P,  •/ 


while  you  want  to  talk  Do 
broadcast  in  the  next  time  slot 


else  if 


then  Tree 

r . 

r^j 

_  2 

u 

else  if  the  next  T-slot  is  empty  then  Tree 


endwhile 


Figure  5:  The  Tree-CRP 

All  processors  try  to  broadcast  in  the  first  slot.  If  there  is  a  collision,  only  those  with  id# 


k  fc 

less  then  —  try  to  transmit  again.  Another  collision  enables  only  those  processors  with  id#  <  j  to 


transmit  and  so  on  until  a  successful  transmission  or  an  empty  slot  occurs.  The  latter  event 
activates  another  subset  of  processors  to  keep  trying  in  the  same  manner.  The  CRP  uses  a 


sequence  of  x-slots  which  are  either  collision  slots  or  empty  slots,  and  terminates  by  a  successful 
transmission,  i.e.  by  a  T  slot. 


Resolving  one  conflict  using  the  Tree-Algorithm  may  require  2  •  logk  slots  in  the  worst 
case.  The  scheme  can  be  easily  modified  to  take  only  logk  t-  slots  by  noticing  that  an  empty  slot 
implies  a  collision  in  the  subsequent  slot  and  can  therefore  be  skipped. 

Theorem  5: 

The  number  of  t-slots,  required  by  the  Merge(n,k)  algorithm,  in  its  two  versions,  with  RPABM 
when  the  Tree  CRP  is  used,  satisfies  #  t  s  n  ■  logk. 

Proof: 

The  number  of  broadcasts  in  the  improved  algorithm  is  In  the  first  version  of  the  algorithm 
there  are  2n  broadcasts  but  only  the  first  broadcast  of  each  value  may  be  involved  in  conflicts  on 
the  channel.  In  subsequent  broadcasts  the  value  initiates  a  cycle  in  which  case  it  is  the  only  one  to 
access  ±e  channel.  Therefore,  for  both  versions  of  the  .Merge  algorithm  there  are  n  broadcasts 
that  may  be  involved  in  a  conflict.  Since  log  k  t-slots  are  used  for  solving  the  conflict,  the  claim 
follows. 

□ 

We  conclude  that  using  the  Tree-CRP  and  the  first  version  of  the  Merge  algorithm  the 
communication  time  is; 

Comm(Merg{nJc),Tree-CRP)={'2.n-\)  T*n  logk  ^O(nlogk)  (16) 

In  this  case 

—  =  0(logk)  (17) 

We  now  show  that  any  algorithm  for  conflict  resolution  can  do  no  better  than  the  Tree- 
CRP.  First,  we  introduce  some  formalism.  Let  a  Conflict-Resolution-Protocol.  CRP.  for  a  set  of  k 
processors  {l,...k}  be  a  funcuon  from  the  power  set  of  {l,...k}  to  the  set  {l,...k}.  The  CRP 


determines  for  any  subset  of  conflicting  processors,  one  processor  that  can  talk. 

The  execution  of  CRP  for  any  subset  of  processors  is  over  the  t  -slotted  channel  where 
each  slot  is  either  an  empty  slot  or  a  conflict  slot.  The  last  slot  is  of  length  T.  The  sequence  of 
empty  slots  and  conflict  slots  up  to  but  not  including  the  T  slot  could  be  considered  the  encoded 
information  by  which  CRP  selects  a  specific  processor.  Note  that  a  processor  does  not  know 
which  subset  is  currently  being  worked  on  by  the  CRP  but  only  whether  or  not  it  belongs  to  this 
subset.  Let  C(Sjt)  be  the  binary  code  (i.e.  the  sequence  of  empty  and  conflict  slots)  which  result 
when  all  processors  in  a  subset  of  processors  S  want  to  transmit  and  x  is  the  first  who  succeeds. 
Two  properties  are  required  from  any  CRP. 

1 .  If  CRP(S)-  X  where  S  is  a  subset  of  { 1 .2 k}  then  x  6  S. 

2.  If  5  [  5  2  are  any  two  subsets  such  that  x  e  S  2  and  if 

CRP  (Si)-x 
and 

CRP  (S  2  )=y  « 
then 

C(S  1  jc  )«C(  S:  ,y ) 

Let  I  (CRP)  be  the  maximum  code  length  of  a  CRP. 

Theorem  6: 

For  every  CRP  defined  over  a  set  of  k  elements  /  (CRP)  >  log  k. 

Proof: 

For  any  given  CRP  we  create  a  special  family  of  subsets  of  processors 

CORE{CRPy={S^.Sz . 

in  the  following  recursive  way: 


where 


Obviously, 


Si  =  n,...Jc} 
^i*l  ~  ^i~  {^J 


x,=CRP{Si) 


c  5t-i  c  ‘''S-^cS2^Si, 


CRP(Si)*CRP{Sj) 

We  now  show  that  the  code  sequences  for  CRP(S,)  are  all  different.  That  is 

vi.y  C(Si,x,)*C(SjjCj) 

Suppose  this  is  not  true  and  for  some  i  j 


where  j  >i.  Then 


S;  =5/  -*X,6  Si 

Thus,  x^  €  Si  (^Sj  with  CRP(S,  )  =  x^  and  CRP(Si  )  =  but  C(S,vt,  )  =  C{S,^j)  which  con¬ 
tradicts  property  2  of  the  CRP.  This  proves  that  every  CRP  must  create  at  least  k  different  binary 
codes  which  is  known  to  require  log  k  binary  slots. 


We  see  that  any  deterministic  CRP  does  add  a  complexity  factor  of  log  k  for  every  broad¬ 
cast  ±at  enables  more  than  one  processor.  It  can  be  shown  that  even  when  we  limit  the  size  of 
conflicts  to  only  two  out  of  the  k  processors  the  worst  case  complexity  of  any  CRP  is  still  log  k . 
The  argument  goes  as  follows;  In  case  of  a  conflict  a  CRP  assigns  a  subset  of  the  processors  to 
talk  in  the  first  time  slot  and  the  other  to  be  silent.  If  there  is  exactly  one  processor  talking,  the 
conflict  is  resolved  and  the  CRP  stops.  Otherwise,  in  case  of  a  collision,  a  subset  of  the  colliding 
processors  can  be  chosen  for  the  second  time  slot.  If  there  was  silence  in  the  first  slot,  a  subset  of 
the  remaimng  processors  will  be  selected  for  the  second  time  slot  etc.  In  the  worst  case  the  two 


colliding  processors  can  always  be  in  the  subset  which  is  larger  in  each  step  of  division,  therefore, 
only  after  log  k  steps  (there  are  k  processors)  the  CRP  will  stop.  This  argument  provides  a  dif¬ 
ferent  proof  to  theorem  6  as  well.  This  suggests  approaching  the  design  of  broadcast  algorithms 
in  a  way  that  minimizes  the  number  of  broadcasts  that  can  result  in  conflict  or  not  to  allow 
conflicts  at  all. 

Conclusion 

In  this  paper  we  addressed  issues  of  design  and  complexity  involved  in  incorporating 
"broadcast  communication"  into  distributed  algorithms.  We  presented  algorithms  for  merging  k 

lists  of  Y  elements  each  by  k  processors  and  proved  the  complexity  to  be  0(« ),  regardless  of  the 

number  of  lists  (processors).  We  also  showed  that  this  performance  is  optimal  under  the  scheme 
of  one-charmel  broadcast. 

We  initially  avoided  the  effect  of  conflicts  which  exist  in  this  mode  of  communication  by 
introducing  the  algonthm  in  an  ideal  environment  in  which  we  pay  no  penalty  for  accessing  the 
channel.  We  then  showed  that  the  problem  of  accessing  the  channel  adds  a  factor  of  at  least  log  k 
to  the  algorithm" s  performance.  This  suggests  a  need  to  investigate  whether  a  different  approach; 
i.e.,  minimizing  the  number  of  conflicts  while  designing  the  algorithm,  might  result  in  a  better 
total  performance.  We  also  note  that  the  use  of  a  single  channel  limits  the  performance  consider¬ 
ably  (for  example,  merging  cannot  be  accomplished  in  less  than  n  sequential  time  slots)  which 
motivates  the  use  of  more  complex  configurations  of  broadcast  with  more  than  one  chaimei.  For 
recent  work  on  broadcast  networks  with  multiple  channels  see  {14, 17], 
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